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THE NATURE AND EDUCATIONAL SIGNIFICANCE 
OF PHYSICAL STATUS AND OF MENTAL, 
PHYSIOLOGICAL, SOCIAL AND 
EMOTIONAL MATURITY 


ARTHUR I. GATES 


Assisted by Grace A. Taylor, Eloise Boeker, and Dorothy Van Alstyne 
Teachers College, Columbia University 


For centuries the relation of mental and physical abilities has been 
a topic of speculation and observation, for more than 30 years a subject 
of experimental inquiry, yet up to the present day, the nature of these 
relations has not been demonstrated in such a way as to leave but one 
interpretation unquestionable. As a result, current scientific litera- 
ture abounds with theories and practical suggestions that are, in whole 
or part, conflicting. 

In another paper, a brief account of several theories has been given 
together with a summary of some 30 studies representative of the con- 
tributions to the subject which have been made since 1893. In the 
present article, aside from a few introductory remarks to portray the 
setting, the results of an experimental study conducted during the 
year 1922-1923 in the Horace Mann School, will be presented. 

Concerning the factors which determine the degree and merit of 
achievement, several points of view are now apparent. 

1. Some authorities hold that native capacity plus educational 
opportunities and management are the main determiners of achieve- 
ment. 

2. Some hold that native capacity, plus optimum educational 
conditions are important—sometimes the main, but not the sole 
factors. Many other traits—physical, emotional, temperamental, 
etc.—may contribute independently in various degrees. Among the 
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other factors, many different ones have received main emphasis. 
Recently there has been, among some writers, a tendency greatly to 
emphasize the influence of maturities of several types. 

To some, maturity appears to be a general factor expressed in 
various degrees by different measurements. Thus; Naccarati and 
Lewy-Guinzburg write:! “The same hormones which promote the 
morphogenesis of the skeleton and muscles of the limbs, promote also 
the development of the psychomotor and psychosensory centers.” 
Dearborn? states that “‘ . . . we are dealing with an individual organ- 
ism whose changing behavior (as noted by the mental tests) is only one 
further indication of its development.” He suggests the need of 
“‘a better measure than hitherto available of the relative development 
of the individual organism as a whole.” 

Others, with less disposition to generalize explanations, are never- 
theless convinced of the dynamic potency of certain forms of maturity, 
among which “physical,” “‘anatomical,’’ or ‘‘ physiological” maturity 
are mainly emphasized. Thus Baldwin, foremost investigator of 
physical growth, writes:* “Physiological age is, the writer believes, 
directly correlated with stages of mental maturation . . . The 
physiologically more mature child has different attitudes, different 
types of emotions, different interests, from the child who is physically 
younger though of the same chronological age. . . . Physiolog- 
ical age has a direct bearing on pedagogical age, as many of our schools 
are beginning to recognize. The larger and physiologically more 
mature child may be able to do certain types of school work better, 
although of inferior ability in specific traits which have been greatly 
emphasized by school curricula. . .. That there is a direct rela- 
tionship between social age and physiological maturity needs only to 
be mentioned to be evident.”’ | 

As a consequence of such convictions, Baldwin has recommended 
in the Twenty-third Yearbook of the National Society for the Study of 
Education, that “physiological age” based on “height and general 
growth, development of the carpal bones” (and, where they may be 
used, evidences of stages of pubescence) be given a heavy, but not a 





1 Naccarati, S. and Lewy-Guinzburg, R. L.: ‘‘Hormones and Intelligence.” 
Journal of Applied Psychology, 1922, pp. 221-234. 

2 Dearborn, W. F.: ‘‘Some Problems of Research in Education.” School 
and Society, June 23, 1924, pp. 675-676. 

’ Baldwin, B. T.: ‘The Physical Growth of Children from Birth to Maturity.” 
University of Towa Studies, Vol. I, No. 1, 1921. 
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definitely stated, weight in the classification and promotion of pupils. 
To test the validity of this procedure and certain related hypotheses, 
an analysis has been undertaken of data gathered during the preceding 
year in the Horace Mann School. 


PLAN OF THE PRESENT INVESTIGATION 


The investigation was designed originally to afford an analysis of 
the interrelations not only of physical and mental abilities, but also 
of maturity—physical, mental, educational, social, and emotional. 
So far as possible objective tests were used; but for appraisals of several 
types of maturity, it was necessary to. rely mainly upon personal 
judgments guided by rating scales. The ratings were repeated and so 
combined with objective tests that it was possible to apply several 
checks upon their reliability and validity. 

Between an intensive appraisment of a smaller number of pupils 
and a less thorough study of larger numbers, the former alternative was 
chosen. The study was designed to yield information oh the question 
as to the extent that measures of physical traits and of the several 
types of maturity may be practically useful in diagnosing the achieve- 
ments of a pupil; in deciding upon the amount of work that may wisely 
be undertaken, and upon the degree to which these traits should be 
taken into account in classification and promotion. 

Two groups, one consisting of 58 pupils of the junior primary divi- 
sion and another of 57 pupils of Grade IV, were selected. Treated 
independently, the results of one will serve to check those of the other. 
An analysis of these groups shows the following data: 


Junior Primary GROUP 


AVERAGE 

AVERAGE AVERAGE Devia- 
NuMBER AGE AD MA TION 

ie tics sinck'e Ale « 62 bdbnbnain atcolebete bs 30 5.66 0.63 7.38 0.96 
Ne eco hk a das 4 4 ae eee 28 5.72 0.67 7.70 0.96 

GraDE IV Group 

rs fi. ees 0 TA 27 9.66 0.56 11.75 .75 

Gees Fe ee eh dd we Saw 30 9.50 0.59 11.68 91 


Within each group, the two sexes show a very similar average age 
and mental age and a similar variability in each of these traits. 

The measures may be divided, not without considerable over- 
lapping however, into six groups. 
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1. Anatomical traits: Ossification of the wrist bones, height, weight, 
chest girth. 

2. Physical or physiological functions: Lung capacity, strength of 
forearm, index of nutrition, rate of heart beat, estimates of physical 
vigor, health and efficiency. 

3. Emotional maturity and stability, estimated by use of rating 
scales. 

4. Social maturity, estimated. 

5. Mental ability and maturity: Stanford-Binet Mental Age, 
mental age corrected by teachers’ judgments, mental maturity 
estimated by means of rating scale. 

6. Scholastic achievement: For Grade IV, battery of six tests from 
the Stanford Achievement Tests plus Horace Mann Tests in language 
and spelling. Also ‘‘educational maturity” estimated by teachers. 
For junior primary, teachers’ records of achievement in knowledge and 
skill plus teachers’ estimates of education maturity on a special scale. 

For those interested in the details of technique the following brief 
descriptions of tests, not however given in exactly the order listed 
above, will perhaps suffice: 


PuysicAL MEASUREMENTS 


1. Ossification of the Wrist Bones.—The Roentgen-ray photographs 
of both wrists were taken at the exact natural size by the Department of 
Roentgenology of the Vanderbilt Clinic associated with the College 
of Physicians and Surgeons of Columbia University. The area of 
each of the eight (or fewer) bones in each wrist was measured by means 
of a planimeter, a mechanical device which gives with accuracy the 
area of any irregular figure. Two readings were taken of each area. 
Where discrepancies were found further readings were made so that the 
error of measurement is negligible. The area of all of the bones in 
both wrists were summated to yield the total amount of ossification. 

(A) The total area of ossification is determined not only by the 
degree to which the process of ossification has advanced but also by 
the size of the individual’s skeleton, presumably. A better index of the 
advancement of ossification, of maturity in this respect, conceivably 
may be the proportion of the area of ossification of maturity which is 
now ossified. This ratio we attempted to approximate in a valid and 
practicable manner. Points were placed on the extreme outer corners 
(points easily and reliably ascertained) of the first metacarpal, fifth 
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metacarpal, the ulna and radius (the styloid process, which is separated 
from the main shaft at these ages was disregarded). Connecting these 
four points yields a quadrilateral which represents an area fairly 
uniformly indicative of the size of the skeleton at the carpal region. 
In all cases this area is larger than the ossified portions contained within 
it. The area of the quadrilateral for each hand could be computed by 
dividing it into two triangles. 

(B) The percentage of ossification was obtained by dividing the 
area of ossification, both hands combined, by the total wrist areas, of 
both hands, obtained as described. These percentages for the junior 
primary group ranged from 0.11 to 0.44, mean 0.25; for the Grade IV 
group, from 0.41 to 0.72, mean 0.55. The gross total areas, both 
hands combined in square inches, were as follows: Junior primary boys, 
mean 0.31234, AD 0.105; for the girls, 0.4506, AD 0.125; Grade IV 
boys, mean 2.30, AD 0.253, girls, mean 2.435, AD 0.2866. Bothin gross 
area and in the ratio of ossification the girls, as has been consistently 
found heretofore, surpass the boys. In all correlations the sexes are 
therefore treated separately. 

1.1 Height: Instrument used: Stadiometer. Measured in terms of 
inches and tenths. 

2. Weight: Instrument used: Buffalo Scale; Strict weight (weight 
of robe subtracted from gross weight). Measured in terms of pounds 
and tenths. 

3. Nutrition: Subtracted actual weight from the norm, as found on 
the height and weight table prepared by Dr. Thomas D. Wood; 
extension of norm was made for those below five years. 

4. Chest Girth: Instrument used: Standard Spring Tape: Made 
when all air was expired. Measurements made in tenths of an inch. 

5. Lung Capacity: Instrument used: Standard Wet Spirometer. 
Trials given: 3 or more to new children; once to those who had the 
“knack.” 

6. Heart Rate: Measurements made by Dr. P. M. Stimson. Instru- 
ment used: Stethescope. Measurement given in terms of rate per 
minute. 

7. Strength of R L Forearm: Instrument used: Hand Dynometer. 
Trials given: 3 or more to new children; once to those who had the 
“knack.” 





1 Measurements 2 to 8 inclusive were made by Dr. P. M. Stimson, assisted by 
Mrs. J. R. McCastline, Miss Maud Marsh and Miss Marie King. 
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Tests OF MENTAL AND EDUCATIONAL ABILITIES 


1. Standford-Binet Mental Age for general mental ability. 
2. For Educational Achievement: A composite line based on the 
following: 


(A) The following tests from the Stanford Achievement Tests: 
1. Reading, paragraph meaning, 20 minutes 


2. Reading, sentence meaning, 10 minutes 
3. Reading, word meaning, 10 minutes 
4. Arithmetic, computation, 20 minutes 
5. Arithmetic, reasoning, 20 minutes 


(B) Tests devised to meet local content: 

1. Spelling, 50 words from Pearson Speller. 

2. Language Usage Test. 

The tests were not all given at the same time, but as follows: 

1. Achievement tests at the end of the school year in May. 

2. Physical tests at beginning of year except X-rays, about 
mid-year. 

3. Mental tests, beginning of year or based on IQ’s given 
the preceding year. Thus the correlations with achieve- 
ment are really predictive. 

4. Estimates of physical fitness, social maturity etc. made 
at end of year but based on observations during the full 
year. 


THE RatTiInG SCALES 


The rating scales were based in the main on concepts intelligible to 
the teachers who used them. Although the same criteria were used 
for both groups, the exact sample situations and illustrations were 
modified to suit each group. The following, which were adopted to 
the Grade IV groups are in essence like those used for the younger 
children. A general rating, with 1 as the lowest and 10 as the highest 
score, was made on each of the following topics. In the following 
descriptions of qualities rated much of the detailed illustrative matter 
in the scales used by the teachers has been omitted. 

(A) Physical Efficiency and Vigor: Including (1) general health 
(2) physical energy shown in work and play (3) stamina. 

(B) Mental Maturity: Including common sense, understanding, 
critical attitude, initiative, perseverance and responsibility in various 
mental activities. 
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(C) Social Maturity: Including (1) responsibility for his own acts, 
property, hygiene; (2) cooperation; (3) respect for law and order and 
(4) leadership. 

(D) Scholastic Maturity: (1) Intellectual curiosity, desire to work; 
(2) habits of work; (3) persistence in work; (4) quality of work. 

(E) Emotional Maturity: (1) Absence of excessive emotionality; 
(2) proper emotional responsiveness; (3) evenness of responsiveness; 
(4) degree of development, maturity vs. babyishness: 


RELIABILITY AND VALIDITY OF RATINGS 


The ratings were made by the teachers after conference with the 
investigators following a study of the scales. They were done with 
great care by teachers whose experiences in the use of scales and in 
appraising pupils for such traits are unusually wide inasmuch as 
estimates by other devices has long been a part of the regular work of 
classification and promotion in the Horace Mann School. 

The results of the ratings for both groups have been carefully scru- 
tinized. The reliability of the judgments may be gauged from the 
following analysis of the results of the junior primary groups. 

Five judges rated all of the children on all of the traits—physical 
fitness, mental, scholastic, social and emotional maturity. The 
reliability of the ratings is suggested by the correlations of the estimates 
of each judge with each other judge. For each trait 10 such corre- 
lations could be secured. It is possible, however, by means of a 
formula given by Kelley! to compute more expeditiously the average 


intercorrelation which is sufficient for our purposes. They are given 
below. 


a LL go b.s w Oke be OSE Ws ots ocbseees 0.59 
ES ks Sear sh Gb aae ehh teens ce 0.57 
SIE LED PO OE ET Te 0.56 
a, bk a espe eb adedownaaks senha sues 0.56 
ais hap ethwonebncdeharey plembtpeeeens 0.55 


The agreement is approximately the same for the several traits 
but in all cases, as almost invariably found, judgments of human traits, 


_ even by the most competent, are by no means infallible. By combin- 


ing the estimates of all judges, a closer approximation to true measures 
will be secured. Knowing the degree of disagreement between an 





1 Kelley, T. L.: “Statistical Method.” Sec, 61. 
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average pair of judges, it is possible to estimate the reliability of a 
judgment based on the independent appraisments of five judges.! 
Thus the combined, equally weighted, estimates of our five judges would 
correlate with a similar combination of ratings of five other equally 
good judges to the extents: from r = 0.88, for physical fitness, tor = 
0.86, for emotional maturity. That is, the reliability (not to be con- 
fused with validity) of our five judgments combined is about equal to 
the reliability of a short reading or arithmetic test; somewhat less 
than that of the Stanford-Binet, two forms of which would correlate 
about 0.92 or higher with such groups. 

While the teachers agreed with each other fairly well, in comparison 
with the average judge of human traits, it is not as yet demonstrated 
that they were agreeing on the right traits. Whether they were really 
judging specifically and validly physical efficiency and vigor, social, 
mental, emotional and scholastic maturity is not easily ascertained. 

A survey of the correlations of the estimated mental and educational 
maturity with the Binet Mental Age and the educational age determined 
by the tests of reading, arithmetic, spelling, etc., may throw some light 
on the meaning of the estimated traits. First, in Table I are given the 
correlations, separately for boys and girls, of educational achievement 
with MA and with mental and educational maturity for Grade IV 
since only in this group were satisfactory measurements of scholastic 
achievement obtained. 


TaBLE I].—CorRELATIONS OF EDUCATIONAL ACHIEVEMENT WITH MA, wiITH EstI!- 
MATED MENTAL Maturity AND SCHOLASTIC MATURITY FOR THE 
GraDE IV Boys (4B) anp Grris (4G) 


MENTAL EDUCATIONAL 


MA MatTuritTy MatTurRItTr 
THE SSS aA Pe ene PE eer .56 .61 .70 
te ee) Sy oe ee i eaenea .63 .55 .78 
Ss. d moe bes weebe oes 443 bReaE .595 .58 .74 


Educational achievement, objectively measured, corresponds more 
closely to educational maturity than to either the Binet Mental Age 
or estimated mental maturity. Educational achievement and esti- 
mated educational maturity are not identical however. The latter 
may include—although there is here no evidence to that effect—certain 
features of scholastic attitude and maturity not fully covered by the 
tests. 


1 Kelley, T. L.: By means of ‘‘ Brown’s formula.” ‘Statistical Method,’’ p. 205. 
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The intercorrelations of MA, estimated mental maturity and edu- 
cational maturity appear in Table II for boys and girls of both the 
Grade IV and the Kindergarten groups. © 


TaBLE II].—INTERCORRELATIONS OF MA MENTAL MATURITY AND EDUCATIONAL 
MATuRITY FOR KINDERGARTEN Boys (KB), KINDERGARTEN GIRLS 
(KG), anp Grape IV Boys (4B) anp Grris (4G) 

r MENTAL r MENTAL r MENTAL 
AGE, AGE, Maturity 


MENTAL EDUCATIONAL EDUCATIONAL, 
MatTuRITY MaTuRITY MATURITY 


EE Se Croke. eae .46 .50 .57 
Rg ars ie Celie aR, Fee .52 .56 .68 
Detccdcs Kon mene cute Leatemeaas bee .65 .59 .77 
We a site oo oe ee POS SR SPAS dn .60 .52 87 
ina 2 iss wees oa oe been oaks saan . 56 .54 .70 


Mental age (Binet) and mental maturity estimated are by no means 
identical, nor is mental age and estimated mental maturity. Between 
the two maturities as estimated there is, at least in Grade IV, a higher 
correlation but in general a distinction between them is suggested. 

No objective tests, as yet thoroughly known, were available by 
means of which the judgments of social and emotional maturity could 
be checked. These two traits are perhaps judged to be more alike 
than they really are. G.S. Gates,' used these ratings with part of the 
Kindergarten groups, which had taken her “social perceptions test.” 
The test consists of identifying the emotional expressions portrayed 
in the Ruchmick photographs. She found substantial correlations of 
social perception test with social and with emotional maturity. She 
suggests that the child’s social perception is a factor which influences a 
teacher’s estimates of both types of maturity. 

Such checks as are available, then, seem to indicate that the 
teachers’ estimates of the several types of maturity are of considerable 
validity. The fact that the estimated traits do agree in showing a 
reasonable relation to the objective tests gives a reason for believing 
that the former were conscientiously and probably well done and that 
they represent certain aspects of mental poise, common sense, maturity 
of mind and general educational maturity not exactly portrayed by 
the objective tests. It will, therefore, be interesting later to discover 
with which of these other variables are most closely correlated. 





1 Gates, G. S.: ‘An Experimental Study of the Growth of Social Perception.” 


Journal of Educational Psychology, Nov., 1923, and another article to be published 
later. 
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TREATMENT OF THE DATA 


The data for the two sexes in both groups (Kindergarten and 
Grades IV) were treated separately. There were, consequently, four 
complete series of intercorrelations. The influence of variations in age, 
which were not great for either group, was, nevertheless, apparent. 
In the Kindergarten group mental and educational abilities were 
correlated positively and in Grade IV, negatively, with age. In order 
to make all results comparable as well as to eliminate the spurious 
effects due to joint association of any two traits with age, this factor 
was eliminated by the technique of partial correlation. For the first 
simple correlations the Pearson Formula was used and for partial 
correlation the usual formula. 

Between each: pair of variables, then, were four correlations (age 
eliminated) for which the average was obtained. All of these data 
appear in Appendix I. 

In the discussion which follows, the average correlations only will 
be mentioned—except occasionally. The student interested in sex 
or age differences—which we are unable to discover—may consult the 
details in the Appendix. 

In computing the multiple correlations the method by Toops? has 
been utilized by means of which the correlation between a criterion 
and a number of measures are combined in such a way as to give the 
highest coefficient. 


THE INTERCORRELATIONS OF PHYSICAL TRAITS 


The average intercorrelations of the physical traits are displayed 
in Table III. With exception of heart rate, all are positive but of 
various magnitudes. Of the several traits, weight gives the highest 
average intercorrelation with others, namely 0.555; chest girth is next 
with 0.51, height with 0.474, the area of ossified wrist bones 0.41, lung 
capacity 0.39, strength of grip 0.33, and the index of nutrition 0.29. 
These correlations are much like those previously found by other 
investigators,* indicating that our groups, in these respects at least, 
are fairly representative and our data reasonably valid and reliable. 





1 Yule: “Statistics.” 

2Toops, H. A.: Tests for Vocational Guidance of Children. Teachers 
College Contributions, 1923, No. 136. 

3 See, for example, Baldwin, B. T.: “‘The Physical Growth of Children from 
Birth to Maturity.” University of Iowa Studies, No. 1, p. 117ff. 
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The slight negative correlations of most of the measures of bodily size 
with heart rate, likewise indicate the representative character of the 
data, since such correlations have been previously found.! 


TaBLE III.—INTERCORRELATIONS OF PuysicaL Traits. Eaca r Is tHe AVERAGE 
or Four Wuicn APPEAR IN THE APPENDIX 











Area | Ratio Lene 
of of 7 ‘ Chest Strength | Nutri- | Heart 

ossifi- | cesis- | Heieht | Weight | aes grip tion | rate 

cation; cation y 
Area of ossification..|...... | 6.88 -60 62 43 31 -25 26 |— .09 
Ratio of ossification.| .88 |....... . 52 .58 41 21 -24 29 |—.05 
Muibascedsexcdes 60 GP Ssasiwes 69 44 .51 45 ll |—.05 
PR es aces Obed 62 58 i ae 65 .39 .40 83 |—.24 
Chest girth......... .43 41 44 Se ts tee0< .59 .36 69 |—.06 
Lung capacity...... .31 -21 .51 .39 HP these nan .46 26 |-—.01 
Strength grip.......| .25 .24 .45 -40 . 36 GR Peanticnee 14 |—.06 
POUIIOOR. cc ccc ecu .26 .29 -l1l .83 .69 . 26 San? ee eee e 6 .09 
Heart rate.......... —.09 |—.05 |—.02 —.24 |-.06| —.01 — .06 09 
Bee cccccccscocecl eGatt ae: sae .5554) . 518 . 393 . 334 .295 |—.055 
































1 Excluding ratio of ossification and heart rate. 
? Excluding area of ossification and heart rate. 
* Excluding heart rate. 

4 Excluding nutrition and heart rate. 

’ Excluding weight and heart rate. 


If the physical measurements are conceived as indicating various 
stages of development or growth, it is apparent that growth is not 
everywhere the same. If growth is conceived to be due to certain 
common factors, e.g., certain hormones, it is apparent that the effects 
of such factors are differential; they do not influence all physical features 
alike. There is, in other words, a considerable specialization in the 
development of particular physical traits despite the fact that the 
association among them is positive rather than negative. Indeed 
Table III looks not unlike a chart of intercorrelations among specific 
mental tests. As in studying mental tests, it may therefore be advis- 
able to seek for some general or comprehensive measure of physical 
fitness or maturity to serve as a criterion for further analysis. The 
general measure in this case is the degree of physical fitness, stamina, 
maturity, as estimated by the teachers on the basis of a school year of 
daily observation. 





1 Howell’s ‘‘ Textbook of Physiology,” Fifth edition, p. 586. 
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The degree to which the physical traits are correlated with physical 
vigor, stamina and health, as estimated by the teachers is disclosed in 
Table IV. 


TaBLeE IV.—CorRRELATIONS OF PuysicaAL Vicor, STAMINA AND HEALTH WITH 
PuysicaL Traits aS EsTIMATED BY TEACHERS 








Area | Ratio Lees 
of of Height | Weight Chest enee Strength Nutri- Heart 
ossifi- | ossifi- girth it grip tion rate 
cation | cation y 
r. Physical vigor, 
a 15 15 .18 .25 .19 .22 .31 .37 .02 
































The correlations are very low, ranging from 0.37 with nutrition to 
0.15 with the measures of the wrist bones. Excluding heart rate 
which is zero, the average of the correlations with physical vigor is 
0.227. How this result is to be interpreted cannot be decided easily. 
It may portray the actual situation; it may be the result of a conceiv- 
able inability of teachers to judge these traits. 

A partial test of these two possibilities is afforded by the technique 
of multiple correlation, a procedure which makes it possible to appraise 
the correlation between the criterion (in this case physical efficiency 
estimated by the teachers) and two or more, or all of the physical 
measures combined in a manner which gives each an optimum weight 
or a maximum influence. 

The multiple correlations (R’s) obtained by the use of Toops’ 
convenient method are shown in Table V. 

Taste V.—MuLtiPLe CoRRELATIONS BETWEEN THE CRITERION’ (PHYSICAL 


STAMINA AND EFFICIENCY ESTIMATED BY TEACHERS) AND THE SEVERAL 
PuysicaL TRAITS 

R physical efficiency with (nutrition and grip combined) = 0.45 
R physical efficiency with (nutrition, grip and weight) = 0.515 
R physical efficiency with (nutrition, grip, weight and lung capacity) = 0.5152 
R physical efficiency with (nutrition, grip, weight, lung capacity and 

chest girth) = 0.536 
R physical efficiency with (nutrition, grip, weight, lung capacity, chest 

girth and height) = 0.61 
R physical efficiency with (nutrition, grip, weight, lung capacity, chest 

girth, height, and ossified ratio) = 0.612 


It will be recalled that the highest single correlation (that of 
nutrition with estimated physical efficiency) was 0.31. With the 
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addition of other traits, the multiple R steadily increases until, with all 
seven measures properly combined, the correlation reaches 0.61—a 
very marked increase. This fact makes two interpretations likely: 
(1) The teachers really know what they are estimating and have con- 
siderable ability to appraise what they have in mind, and (2) general 
physical stamina, vigor and efficiency are not well indicated—con- 
trary to frequent assertions—by any one physical trait. The degree 
of ossification of the carpal bones, measures of nutrition, weight, 
height, grip, or any other single physical trait is but mildly indicative 
of the general physical efficiency status, but when all of these are 
properly combined, they yield a significant measure. Were all of these 
measures more perfect, and especially, were the teachers’ judgment not 
fallible, the correlations would doubtless be higher than those obtained. 

In particular, Baldwin’s choice of height, weight, and the area of 
ossification of the wrist bones appears to be by no means an optimum 
choice of three. These three are, indeed, as Baldwin points out, 
closely correlated with each other, but this is exactly the trouble. 
They sample too largely size of the skeleton, too slightly other impor- 
tant features of physical status. Among our pupils a better team, for 
purposes of gauging physical fitness, stamina and maturity would be 
nutrition, grip, and weight, or nutrition, lung capacity, and height. 

Apparently, in appraising physical traits we have a very close, 
perhaps a perfect, counterpart of the situation in measuring mental 
abilities. In both fields, different single measures correlate posi- 
tively, but far from perfectly with each other; in neither field does it 
seem likely that any one trait will satisfactorily represent general 
abilities. To secure a reliable index either of general mental ability or 
general physical efficiency, it is necessary to sample and combine many 
different representative single measurements. 


THe DrEGREE TO WuHicH PuysicAL MEASUREMENTS PorRTRAY MATUu- 
RITY—MENTAL, EDUCATIONAL, EMOTIONAL AND SOCIAL 


We have observed that no single physical trait yields a high correla- 
tion with physical vigor as estimated and that probably valid distinc- 
tions exist among the several types of maturity. In this section we 
shall*attempt to ascertain to what degree physical measurement— 
ossification of the wrist bones, height, ete.—yield a practically service- 
able index of any other form of maturity. The variables are: (1) The 
eight physical measures, (2) mental age measured by the Stanford- 
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Binet, (3) mental maturity, (4) social maturity, (5) educational matu- 
rity, and (6) emotional maturity, all estimated by teachers together with 
(7) a sum of ratings (3), (4), (5) and (6), and (8) estimated physical 
vigor which is repeated from Table IV to provide a basis of comparison. 
The correlations appear in Table VI. 


TaBLE VI.—CoRRELATIONS OF PuysIcAL TRAITS WITH VARIOUS MATURITIEs. 
Eacu r 18 THE AVERAGE OF FouR 











2 3 4 5 6 7 8 9 
Binet | Mental | Social | 200% | Emo | Sum | 5) sical 
MA | maturity | maturity tional tional of vigor Mean 
maturity | maturity | ratings 

Ossified area........ .07 .05 .13 .12 .13 -1l .15 11 
Ossified ratio........ -1l .15 .24 15 .20 .22 15 .17 
SN .06 -1l -1l .07 .15 .09 .18 -1l 
EN ac pik woes 6 en .10 13 .09 ey .17 .19 .25 .16 
Chest girth......... .09 .o9 . 15 14 17 12 19 14 
Lung capacity....... .09 .09 12 .06 -1l .10 .22 ell 
clint a nindivahtethen'e to .06 .07 .08 15 .05 .20 .31 .13 
Nutrition........... .13 15 .18 .17 15 .27 .37 .20 
RE ae a a Sota 09 1l .14 13 14 16 23 





























No mean correlation (at the foot of the table) is as high as that 
between the physical traits and estimated physical vigor, which, as 
previously stated, is very. low. It is apparent that no physical trait 
alone correlates in any practically significant degree with mental age, 
with mental, social, scholastic or emotional maturity or with these 
ratings, equally weighted, in the ‘‘sum of the ratings.” Although 
small, the coefficients are invariably positive. They provide new 
evidence of the fact, frequently observed during the past 25 years, 
that desirable traits of all types tend to go together, that the possession 
of some good trait implies slightly the possession of other good traits 
rather than the opposite. All of the correlations in Table VI appear 
to express this tendency and no more; the coefficients are about as 
low as they could be and still show this general disposition to positive 
association. ‘They are, however, so low as to be of no utility in pre- 
dictions of individual status for practical purposes. 

No one physical trait appears to be sensibly better than another 
as a symptom of mental age or of any type of maturity here measured 
since all of the correlations are so low. Yet, as it was found that 
several physical traits combined gave a much higher correlation with 
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physical vigor, etc., it is conceivable that it may be found that a team 
of physical tests may give a substantial correlation with MA or some 
type of maturity. By means of the technique of multiple correlation 
the facts may be disclosed. 

The multiple correlations of physical traits, when one after another 
is added to the team, with mental age is shown in Table VII. The 
index of nutrition alone gave an r of 0.13 whereas a composite of 7 


physical traits give a coefficient of 0.21, a figure indicating but a slender 
relationship. 


TaBLE VII.—TxHE CoRRELATIONS BETWEEN MENTAL AGE AND PuysiIcAL TRAITS 
TEAMED TOGETHER BY MULTIPLE CORRELATION 


GRE AES i aif ar nae ET ET ote. al Sing Ga .130 
MA with (nutrition and ossified ratio)...................0e cece cece eeee .150 
MA with (nutrition and ossified ratio and weight)....................... .170 
MA with (nutrition and ossified ratio and weight and chest girth)......... .185 
MA with (nutrition and ossified ratio and weight and chest girth and lung 
ts 5 nikki os ate ees no gti EEE ee 6 nowy Aiko ad o.me. » .200 
MA with (nutrition and ossified ratio and weight and chest girth and lung 
capacity and height)..... PEGUMED S 66 cc cREMRERL VE e ee oo CECE Ce tbs cee .211 
MA with (nutrition and ossified ratio and weight and chest girth and lung 
gL ee .212 


TaBLE VIII.—Tue CorRRELATIONS BETWEEN SociaL MATURITY AND PHYSICAL 
Traits TEAMED TOGETHER BY MULTIPLE CORRELATION 


ee es nd cc dabdsbetsarsctbbheresnceeons 24 
Social maturity with (ossified ratio and nutrition)....................... . 266 
Social maturity with (ossified ratio and nutrition and weight)............. .349 
Social maturity and (ossified ratio and nutrition and weight and chest 
i a ra a Bs Mal gS a al cl Cand lela dis os in a wig 0 0 . 366 
Social maturity with (ossified ratio and nutrition and weight and chest 
i 8 ns a a od fie ae eK aA «OCEAN ED C8 As Oa ws beeee . 369 
Social maturity with (ossified ratio and nutrition ond weight and chest 
girth and lung capacity and height)..................... cece cece ees .372 
Social maturity with (ossified ratio and nutrition and weight and chest 
girth and lung capacity and height and grip)........................ .374 


Since the correlations of physical traits with the several types of 
maturity, mental, social, educational and emotional, are about the 
same in the average, the multiple correlations for one will disclose 
essentially the facts for all. In Table VIII the multiple correlations 
of the physical measures with social maturity are given. 

The highest obtainable correlation with social maturity, produced 
by combining seven physical traits, is 0.374. This figure is appreci- 
ably higher than the corresponding multiple correlation (0.21) with 
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mental age, objectively measured, but it is by no means a “high”’ 
correlation. If these seven physical traits represent the basis of 
physiological maturity, it is clear that it is by no means synonymous 
with social maturity. At least, social maturity, as here estimated, 
cannot be usefully predicted by any combination of such physical 
characters as were used. The same must be said of mental, educational 
and emotional maturity which would correlate about the same or lower 
with the physical traits. Baldwin’s statements, as already given, find 
no support in our data. 

Baldwin appears to use, on certain occasions, s, the stage of the onset 
of pubescence as a criterion of general maturity,' although his state- 
ments are not very exact. He points out the correlation between 
height and age of pubescence, a correlation thatisnothigh. Evenif the 
associations were very close, the determination of the exact significance 
of the glandular functions of pubescence awaits discovery. To treat 
this function as evidence of final maturity in general, or even of social 
or emotional maturity, is to assume what is unproved. The onset of 
these sexual functions may be of but limited significance. 


INTERCORRELATIONS OF VARIOUS MATURITIES AS ESTIMATED 


In Table IX are given the average intercorrelations of the various 
types of maturity which were estimated by teachers. 


TaBLE [X.—CorRELATIONS OF TRAITS AS ESTIMATED BY THE TEACHERS. EAcH 
CoEFFICIENT THE AVERAGE OF FouR 








Physical | Mental | Social Educa- | Emo- 
vigor | maturity | maturity | “OOM | sional 
go y y maturity | maturity 
Physical vigor............ .43 .40 . 36 .42 
Mental maturity.......... .43 Le .69 .70 .60 
Social maturity........... .40 .69 aes .43 .66 
Educational maturity...... .36 .70 .43 .70 
Emotional maturity....... .42 .60 .66 70 

BOONE, oi. do ss cdidaen .40 .61 .54 54 .60 




















1“The Physical Growth of Children,” p. 188ff. and the Twenty-third Year- 
book, p. 38. 
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Mental, social, educational and emotional maturity are positively 
associated; the average of the intercorrelations among them is 0.539. 
It is probable that the defect so uniformly found in such estimates, 
namely, that certain traits are repeatedly appraised in each (Thorn- 
dike’s “halo” effect) thus increasing the correlations among them 
above what actually exist in nature, is to be encountered in these 
data also; few judges, if any, are able to avoid entirely this constant 
error. That is, since certain traits influence the judgments in the case 
of all of these maturities, the intercorrelations are probably higher 
than would be obtained by perfectly objective measurements. 

Making moderate allowances for this probably constant error, it 
appears that we are here dealing not with a general maturity, every- 
where one and the same, but with a plurality of growths, which, while 
correlated positively in general, nevertheless have a noticeable inde- 
pendence. Children do not mature at a uniform rate in each and 
every trait; they typically mature at somewhat different rates in par- 
ticular traits. The notion that certain hormones influence or control 
growth in general is likely to lead us into conceptions quite detrimental 
to educational diagnosis. General maturity is, like general intelli- 
gence, general scholastic status, or general athletic ability, merely an 
arithmetic average of many traits which are distributed above and 
below the central level. Growth is not unified; it is specialized. 


CORRELATIONS OF EDUCATIONAL ACHIEVEMENT WITH PHYSICAL 
TEsTs AND ESTIMATED HEALTH, VIGOR AND STAMINA 


In this section, we shall attempt to ascertain the degree to which 
physical status, as determined by the physical health, vigor and 
stamina as estimated by teachers, contribute specifically to educational 
achievement. In making the necessary computations we have used 
only the results from the two Grade IV groups since only for them have 
been secured adequate measures of attainment. Achievement is 
based upon a series of educational tests; it is strictly scholastic in 
character. 

The correlations, which are the means for both sexes, between the 
physical traits and both MA and achievement for the Grade IV 
groups are given in Table X. 

While studying Table X the reader should recall that the measures 
upon which the MA’s are based were given at the beginning of the year 
(or earlier) at the time the physical measurements were taken except 
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for the X-rays of the wrists which were obtained at midyear, whereas 
educational achievement was measured at the end of the year. No 
physical measurements, however, give a high correlation with either 
MA or achievement, nor are the correlations with physical vigor very 
substantial. Between MA and educational achievement, on the other 
hand, is a correlation of 0.595, which is in the vicinity of correlations 


TABLE X.—CoORRELATIONS OF PuysicaAL TRAITS WITH MA anp EDUCATIONAL 
ACHIEVEMENT FOR GrapE IV: Eacn Coerricrent Is THE MEAN OF 
Two—One FrrRoM Eacu Sex 











S r=] a 

& & ve > § a 

3 22 2 % S/2Sl a] 3 33 Si < 
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= = he 8 a O s 3.45 

a = = a ° vA ® = 

-) & 6) a 
is tk ado onc wale nad .04 12} .02 | .02 |; .01 | .03 | .02 | .08 255 
Educational achievement...... .02 16 01; .07 | .038; .0O | .15 |} .15 .38 .595 



































commonly found under such conditions. It represents the prediction 
of achievement at least a year in advance of the test. To foretell 
educational achievement, then, there can be no doubt that, among the 
variables in Table X, the Binet Mental Ageis first choice. Toascertain 
the usefulness of the others, a method should be adopted which dis- 
closes their specific or residual contribution over and above the MA. 

The problem is as follows: The obtained correlation between MA and 
EA in the case of Grade I[V—where EA was most reliably measured— 
is 0.595. To what extent will the correlation with EA be increased if 
we add to MA one or all of the physical measures? That is, by com- 


TaBLE XI.—PREDICTIONS OF EDUCATIONAL ACHIEVEMENT BY MA ALONE AND 
BY MA ComBINED BY MULTIPLE CORRELATION WITH PHYSICAL 
MEASUREMENTS. BasED ON Data or GravDE IV ONLY 


r educational A with MA alone................. ccc ccc s cc ccenecscees 0.595 
R educational A with (MA + ossified ratio)..................2..000005 0.6018 
R educational A with (MA + ossified ratio + height).................. 0.6018 
R educational A with (MA + ossified ratio + height + weight)......... 0.6092 
R educational A with (MA -+ ossified ratio + height + weight + chest 

Ne SCN Me OTN a. AV LUAU i wl Oe et aN OV ee dGhs 0.6094 
R educational A with (MA + ossified ratio + height + weight + chest 

Ey o> DANE, UID noo: 0-0 0 0d bind on 2 boomed > epetecenn en aed 0.6094 
R educational A with (MA + ossified ratio + height + weight + chest 

rr SC MD, oo. wc eka peeh On bone Lakes eh os ob 0.6128 
R educational A with (MA -+ ossified ratio + height + weight + chest 

girth + lung capacity + grip + nutrition).....................05. 0.6258 
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bining MA in the best possible manner with the physical tests, how 
much better a prediction of achievement is obtained than when MA is 
used alone? The answer is given in Table XI. 

If all of these traits properly weighted and combined may be taken 
as a fair indication of physical status it is then apparent that among 
our pupils, the influence on achievement is small. If the rdle of mental 
ability in producing achievement is 0.595, and the réle of MA combined 
with seven physical traits—including such a one as the widely heralded 
ossification of the wrist bones—is but 0.625, the implication is unequi- 
vocal—the influence of physical status, as measured by these tests, on 
achievement among children as we find them in a first-rate school is 
real but on the average slight. 

It was observed earlier that the correlation between all of the 
objective physical measurements and the teachers’ estimates of 
physical vigor, stamina, etc. was not perfect; in fact, it was R = 0.612. 
It will be advisable, therefore, to see to what extent addition of the 
teachers’ appraisements of physical efficiency may affect the predic- 


tions of achievement yielded by MA alone. The results are given 
below, in Table XII. 


TaBLe XII 


Ca BF ea he CE ew ia ert. 0.595 
R educational A with (MA + estimated physical efficiency)............. 0.6529 


Physical vigor, stamina, efficiency as estimated by the teachers, 
when combined in properly weighted form with mental age give a 
somewhat higher correlation with achievement than does mental age 
alone. The combination of MA and teachers’ estimates of physical 
efficiency is better than the combination of MA with seven physical 
measurements; the FR for the former is 0.6529 as compared to an R 
of 0.625 for the latter. 

It is possible, finally, to combine with MA both the estimated 
physical efficiency and the team of eight physical measurements. The 
resulting multiple correlation with educational achievement is 0.6538, 
barely perceptibly higher than that yielded by MA and estimated 
physical vigor. 

This figure, R = 0.6538, is the highest correlation obtainable by the 
addition of all physical traits, estimated and measured. It is sensibly 
higher than 0.595, the correlation between educational achievement 
and mental age alone. It indicates that among the pupils in Horace 
Mann School, under the conditions of the particular year concerned, 
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physical superiority, particularly vigor and stamina, effect a positive 
but nevertheless slight influence upon achievement. In comparison, 
intellectual aptitude is by far the more important determining factor. 

The correlation of mental age plus seven physical measures plus 
estimated physical vigor with achievement falls far short of unity. 

We may now proceed to ascertain whether other traits, particularly 
social and emotional maturities, evaluated in this study appear to 
influence accomplishment. 


THE INFLUENCE OF SOCIAL AND EMOTIONAL MATURITY UPON EpuUcaA- 
TIONAL ACHIEVEMENT 


The significance and reliability of these two estimates of maturity 
have been already given. For the two Grade IV groups, the mean 
correlations obtained are given in the accompanying Table XIII. 


TaBLeE XIIJ.—MEAN or CORRELATIONS FOR THE Two GraDE IV Groups 


SociaL EMOTIONAL 
MaTuritTy MAaTuRITY 
OS GEL, 5s abc's ob.cdnc Gan adldnisewee 0.20 0.165 
Educational achievement with..................... 0.19 0.20 
a er Fike 0.645 


Social and emotional maturity are substantially correlated but 
neither is closely associated with the Binet Mental Age. This fact has 
been frequently stated on empirical grounds; that is, among those of a 
given age and intelligence quotient, quite wide variations in social 
and emotional, as well as physically maturity, are usually found. 
By many writers it is often implied if not asserted that social and 
emotional maturity enter potently into the determination of scho- 
lastic achievement. While this has been implied, it has never, to our 
knowledge, been substantiated by experimental evidence. 

Between achievement and either social or emotional maturity the 
correlation is approximately 0.20 indicating that the latter exert no 
great influence on the former. The degree to which either or both 
together, when teamed with mental age, will increase the prediction of 
achievement, is disclosed by the multiple correlations in Table XIV. 
TaBLE XIV.—CoRRELATIONS OF ACHIEVEMENT WITH MA WITHOUT AND WITH 

THE ADDITION OF SociaAL AND EMoTIONAL MATURITY 


Simple r educational achievement with MA.....................2065 0.595 
Multiple R educational achievement with (MA + social maturity)..... 0.603 
Multiple R educational achievement with (MA + emotional maturity). 0.604 
Multiple R educational achievement with (MA + social maturity + 
PEND. co dntss soc ct ESE COR Mev as fet ae Rhu eeea's 0.6042 
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The effect of adding either social or emotional maturity, or both, to 
MA by multiple correlation is to increase the correlation with achieve- 
ment in a quite negligible degree. This is tantamount to saying that to 
combine social or emotional maturity with MA by rough and ready 
methods, by any manner save an optimum weighting of each as deter- 
mined by the regression coefficients, would result in most instances in 
no increase, quite possibly in a decrease, of the predictive value of the 
team below that of MA alone. 

Between physical efficiency as estimated and social and emotional 
maturity a fair correlation exists. It has been found that the addition 
of the former to MA increases somewhat the correlation with achieve- 
ment. How much this increase is and how little further improve- 


ment is to be obtained by adding social and emotional maturity is 
disclosed in Table XV. 


TaBLE XV.—THE CORRELATIONS WITH ACHIEVEMENT PRODUCED BY ADDING 
to MA, Puysicat Erricrency, EMOTIONAL AND SociaL Maturity 


R educational achievement with (MA + physical efficiency)........ .... .6529 

R educational achievement with (MA + physical efficiency + emctivnal 
IS ee EE ot CS ae ee .6531 

R educational achievement with (MA + physical efficiency + emotional 
EY “9 I cn so. ww nh da cece nrcsccsdsccepecvecece .6533 


The increase in correlation brought about by both social and 
emotional maturity is but 0.0004; a quantity quite negligible. 

Physical fitness, then, appears to exert a greater specific influence 
(7.e., over and above the correlation with MA) upon achievement than 
does either social or emotional maturity or both combined. Both 
combined add practically nothing of value either to a team of MA and 


physical fitness or to MA alone for purposes of predicting scholastic 
achievement. 


SUMMARY AND Discussion oF REsULTs: CERTAIN PossIBLE LImITA- 
TIONS OF THE DATA 


Before attempting general interpretations of results, certain con- 
ceivable limitations of the data must be considered. 

First, the number of pupils utilized is not enormous. It would, of 
course, have been better to have had more. We believe, however, 
that this limitation is not serious. For this statement the reason is, 
primarily, that each correlation is considered in relation, not to those 
which may be obtained in larger unselected groups, but in comparison 
with others obtained in the same group. The groups were large enough 
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to yield representative coefficients between MA and achievement, 
between various physical measurements, and the probability is that the 
others are fairly representative. 

Second, the groups are select; they represent in the main an upper 
class from superior homes. This is true; indeed, the main problem 
before us was the determination of the practical value of various 
measurements, other than intelligence, for classifying and managing 
these particular groups. There is some evidence, not conclusive as yet, 
that physical traits show a closer parallel with intellectual aptitudes 
and achievements among the dull than among the bright.! But in 
any case, the results will probably not differ greatly from those here 
obtained. 

Our problem is the practical one of ascertaining the usefulness of 
the several tests and estimates studied under the actual schoolroom 
conditions. Finding the pupils in the actual classes and facing the 
task of deciding what should be done for the following year, the 
problem must be taken up in its concrete setting. Each variable 
should be tested on its practical value in assisting in the proper classi- 
fication of pupils. Correlations too low to be of value are rather 
certain to be too low to have practical value in other school-class 
situations. 

Third, the estimates of social, emotional and other types of 
maturity may not agree closely with the facts. This is also true, but 
it may be urged that since the teachers by combined judgments could 
estimate intelligence, physical fitness and scholastic maturity fairly 
well, they could also gauge other maturities fairly well. The relia- 
bilities of the appraisals, at least, were about the same for all traits 
judged. It is not improbable that objective measurements of social, 
emotional and other maturities, some day to be obtained perhaps, 
may be more useful for the purposes for which estimates were here 
tried. We believe, however, that our estimates of these traits are 
about as good as any and better than most that may now be secured 
under school conditions. 


THE SIGNIFICANCE OF PHYSICAL MEASUREMENTS 
Our data force us to an apparent disagreement with Baldwin, 


Rotch, Woodrow, and others concerning the general significance, the 


1 Murdock, K. and Sullivan, L. R.: appendix, but compare with Burt, Cyril: 
“The Distribution and Relations of Educational Abilities.’”’ London: P. S. 
King, 1917, p. 85ff. 
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predictive value, of physical traits. These writers have suggested 
that certain traits, notably the X-ray of the wrist bones, is indicative 
of general anatomical, physiological or physical maturity or of ‘“ physi- 
cal fitness’ or of ‘‘maturity” in general. Thus Baldwin recently 
has written that ‘‘there is a high correlation of the development of 
carpal bones . . . with growth in stature and with other criteria of 
maturity.” Quotations from Rotch and Woodrow are given above. 

Our objections are: 

First, that the correlations among physical measures while always 
positive and often high are also often low. No single measure is an 
adequate index of the status of all others. The area of ossification 


of the wrist bones, for example, while correlating about 0.60 with height | 


and weight, yields lower coefficients, 0.43, 0.31, 0.26 and 0.25 respec- 
tively with chest girth, lung capacity, strength of grip and an index of 
nutrition. To secure a measure of general physical nee, it is nec- 
essary to combine several physical measurements. 

Second, no single ph ysical measure yields a high ‘arr with 
the estimates of physical vigor, stamina, maturity and fitness made 
by several teachers after a school year of observation of their pupils. 
The ossification of the carpal bones, for example, yields an average 
correlation of 0.15 with these estimates; the highest correlation 0.37 
is given by nutrition. When all seven different physical measures are 
combined by the technique of multiple correlation, the correlation 
with physical fitness is 0.61. Specialization in the development of 
physical features are thus indicated again; no single touchstone of 
physical fitness appears. It may be objected, of course, that the 
teachers’ judgments are inadequate. The fact that the correlations 
run up steadily as more measures of physical traits are added to the 
team, is some evidence to the contrary and other evidence will be 
observed later, particularly the fact that the correlation of the teachers’ 
estimate of physical fitness adds to the correlation with educational 
achievement over and above the influence of MA—not any more than 
any single physical trait but more than all of them combined. 

Third, no physical trait is an adequate index—as Baldwin and other 
writers imply—of such types of maturity as mental, scholastic, social, 
emotional, or ‘“‘general’’ maturity. For example, the X-ray of the 
carpal bones correlates from 0.07 to 0.24 with the several ratings. 
No other single physical trait varies much from the average correla- 


1 Baldwin, B. T.: The Twenty-third Yearbook of the National Society for 
Study of Education. Part I, p. 39. Italics ours. 
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tions for all physical measures which are as follows: With mental age, 
0.09; with estimated mental maturity, 0.11; with social maturity, 0.14; 
with educational maturity, 0.13, with emotional maturity 0.14; 
with the sum of all maturity ratings (7.e. “general maturity’’) 0.16. 
Clearly there has been a great overestimation of the general signifi- 
cance of many of these physical measurements. With no measure of 
maturity does a single physical trait give a correlation practically 
serviceable for individual prediction. Furthermore, when all of the 
physical measures here used are teamed together by partial correlation, 
the resulting multiple coefficients are in no case high except with esti- 
mated physical fitness. All traits combined with weightings to yield 
the highest possible correlation, yield with mental age 0.21, and with 
social maturity 0.37 which was the highest obtained. 

Mere physical status or maturity, then, however adequately 
measured, apparently does by no means gauge mental, educational, 
social or emotional maturity satisfactorily. Classifications of pupils 
alike in general physical status will not result in a satisfactory classifi- 
cation on any other basis. They will be alike physically provided 
enough physical measurements are made, but they may be quite unlike 
in mental age, in mental, social, scholastic and emotional maturity. 
The implications of Baldwin’s statements that: ‘‘ Physiological age as 
indicated by weight and height’ is “‘directly correlated with stages of 
mental maturation;’’ indicates “different attitudes, different types of 
emotions, different interests’’ that its “relationship [to] social age 

. needs only to be mentioned to be evident,’’ are not in harmony 
with the facts as we have found them in Horace Mann School. Mental 
educational, social, and emotional maturity—as the consensus of opin- 
ion among our teachers portrays them—are not represented at all well 
by the physical traits. 


Maturity Nor UNITARY BUT SPECIALIZED 


Maturities, as we have appraised them, are not everywhere one 
and the same. Growth is specialized and has many phases. Insofar 
as physical status, mental age, educational age, and estimated mental, 
physical, social, scholastic and emotional maturities as here secured 
represent phases or types of growth, the conclusion apparent is that 
growth isnot everywhere uniform and single but varied and in different 
degrees, independent. In our groups, there is little association 
between mental and physical growth, or between social or emotional 
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maturity and educational age. Among some of the estimated maturi- 
ties, the correlations are relatively high, due in part to the familiar 
tendency of human judges to fall into errors (the “halo” influence) 
which produces correlations that are spuriously high. When reason- 
able allowances for these influences are made, the disparity among the 
various maturities or growth which appear to exist in nature, are 
strongly suggested. 

These statements should not be taken to mean that the growths are 
quite or even equally independent and unrelated. All are positively 
associated and at the least we have adduced new evidence that between 
any two desirable traits the correlation is positive. The correlation 
between mental age and the composite of physical traits is, to our 
mind, an expression of this fact and little if any more. The central 
tendency of other findings since 1893, as we see it, has been in this 
direction.! This would mean, of course, that a group of bright children 
of a given age, would be physically superior on the average to a group 
of average or dull children, or that a group of physically superior 
children would show a higher average mentality than a group of aver- 
age or inferior physique. But for individual diagnosis, such a corre- 
lation would be practically useless. It is probable that the associations 
among some of the maturities, such as social and emotional, are more 
close—how close, our data will scarcely disclose; but at the highest the 
correlation will, we believe, be nearer this minimum positive associa- 
tion than unity. 

Classification, then, in terms of maturity offers this difficulty: 
The several varieties, while intercorrelated variously, are nevertheless 
far from identical. Arrange the group according to one and there is 
left a wide range in others. 

The problem of classifying pupils can scarcely be adequately dis- 
cussed until the objectives have been determined. It is the object to 
group together those that are alike in mental age, mental maturity, 
physical development, social, scholastic or emotional maturity, or 
what? Since students of the subject will disagree on these matters, 
it will be advisable to take up the problem from definite but different 
points of view attempting to ascertain what measures or estimates 
will best enable us to gather together those most alike in some respect. 
First, let us consider the problem from the point of view of predicting 
educational achievement or the assumption that our purpose, for the 





1 See summary in the Teachers College Record, 1924. 
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moment, is to get together pupils whose prospects of advancement in 
these respects are similar. 


BEARING OF RESULTS ON PROBLEM OF CLASSIFICATION OF PUPILS 
FOR ACADEMIC INSTRUCTION 


The problem of classification of pupils, once mainly solved on the 
basis of demonstrated or estimated achievement, is now more difficult 
inasmuch as expectation of progress is the more prominent considera- 
tion. It is now considered not enough to group together children who 
are scholastically alike at the moment; those should be put together who 
will advance similarly. It is necessary, therefore, that means of pre- 
dicting probable achievement be secured. The degree to which the 
several variables will enable us to increase the accuracy of our pre- 
diction of achievement becomes the matter of prime import. 

Of all the tests or estimates used in this study, the Stanford-Binet 
Mental Age gave the best prediction of achievement. Although all 
of these tests were given at the beginning of the year, or at the begin- 
ning of the preceding year, the final attainments in May were foretold 
with an accuracy indicated by an uncorrected correlation of .595. 
Except for the teachers’ judgments of mental maturity (made at the 
end of the year and doubtless influenced by knowledge of test results) 
no other variable approaches this prediction. The physical measure- 
ments given at the beginning of the year gave correlations as follows 
(see Table XVI). Height .01, weight .07, chest girth .03, lung capac- 
ity .00, grip .15, nutrition .15. The measures of ossification of the 
wrist bones, secured at midyear, gave for total area, .02 and percent- 
age of ossification 0.16. All of these correlations are individually so 
small as to be of no particular value. Achievement is correlated with 
physical efficiency during the year, estimated at the end, .38; with 
social maturity .19; with emotional maturity .20. 

How great the specific influences of these variables may be is dis- 
closed by the multiple correlations obtained when one or more 
variables are added to MA to give the highest obtainable correlation 
with achievement as compared to the coefficient which MA alone yields. 
When all of the physical measures are teamed with MA the coefficient 
with achievement is 0.626, an almost insensible rise above the 0.595 
given by MA alone (see Table XVI). 

Baldwin’s statements on these points are: “‘ Physiological age has a 
direct bearing on pedagogical age, as many of our schools are beginning 
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to recognize. The larger and physiologically more mature child may 
be able to do certain types of school work better, although of inferior 
ability in specific traits which have been greatly emphasized by school 
curricula. No child should be promoted or demoted without taking 
into account his or her physiological age. Girls may be expected to 
progress more rapidly than boys.’’! For these statements, we find 
no support in his data and in ours essentially a refutation insofar as 
the influence of the physical traits, specifically, upon scholastic 
achievement among Horace Mann pupils is concerned. 

Better than the team of physical measurements is the teacher’s 
estimate of physical fitness, vigor and stamina. Whereas the team 
of physical measurements increases the coefficient from 0.595, that 
given by MA alone, to 0.626, the addition of estimated physical fitness 
pushes the coefficient to 0.653. This figure is higher than those pro- 
duced by adding singly social or emotional maturity and the same as 
that given by adding the teachers’ estimate of mental maturity. 
When MA, estimated physical fitness and mental maturity are all com- 
bined, the correlation with achievement becomes 0.67. This was the 
highest multiple correlation secured except those which included the 
teachers’ estimate of educational maturity. This procedure is, how- 
ever, barely legitimate since the teachers, knowing fairly well their 
pupils’ accomplishment, may have been more or less influenced by it 
in their judgments. The result is thus, in a measure, a spurious self 
correlation. (Data summarized in Table XVI.) 


Taste XVI 
(A) Simple Correlations of Various Single Traits with Educational Achievement 
Grade IV, Average r’s for Boys and Girls 


Binet Mental Age............. 0.595 Strength of grip................ 0.15 
Estimated mental maturity..... ER 0.07 
Estimated physical efficiency... 0.38 Chest girth..................... 0.03 
Estimated emotiona! maturity.. 0.20 Area of ossification.............. 0.02 
Estimated social maturity...... ee el eae cals anht's nbne cee 0.01 
Ratio wrist bones ossified....... Die | MINI hii oh an cc cic cece 0.00 
Nutrition index............... 0.15 All physical traits (estimated).... 0.24 
(B) Multiple Correlations, Educational Achievement with Mental Age Plus 
Some Other Trait 

Educational achievement with (MA + social maturity)................ 0.603 

Educational achievement with (MA + emotional maturity)............. 0.604 

Educational achievement with (MA + seven physical traits)............ 0.626 

Educational achievement with (MA + mental maturity)............... 0.6519 
Educational achievement with (MA + physical efficiency).............. 0.6529 


1“The Physical Growth of Children.”’ p. 197. 
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TaBLE XVI.—Continued 


(C) Multiple Correlations, Educational Achievement with MA Plus Several 
Other Traits 


Educational achievement with (MA + social maturity + emotional 


a ink 8p oc ji eile se wien Shella, Ak ed ae nan ee bd ohm 0.6042 
Educational achievement with (MA + physical efficiency + emotional 
EA Et ARIES a Bi aati: A eile abe ae ya 0.6531 
Educational achievement with (MA + physical efficiency + social 
maturity + emotional maturity)................ 0c cece cece eee eee 0.6533 
Educational achievement with (MA + physical efficiency + seven phys- 
NEE, RRR RE Is Ae eae uit a aay 2 8 eer 0.6538 
Educational achievement with (MA + physical efficiency + mental 
Te ced h, abn slap a Samad ceed bE Abo d ORs PAA 0.6698 
Educational achievement with (MA + mental maturity + scholastic 
Naa ae Gall Reals vide tbs Git sbhs.c Vebebh Ov dba dis vtdbee ds 0.7826 


The main finding here is that the physical status, based on a team of 
tests carefully done by experts, appears to exert a barely perceptible— 
and a practically unimportant—influence on achievement of the sort 
measured by scholastic tests. Estimated social and emotional maturity 
show even less; mental maturity excluding what is common to the MA 
somewhat more than physical status and about the same amount 
specifically as estimated physical fitness. The best prediction would 


give to any trait here studied, other than mental ability, a very slight 
weight. 


SHOULD THE OBJECT OF CLASSIFICATION Br INTELLECTUAL OR 
SCHOLASTIC SIMILARITY ALONE? 


These statements are made on the basis of an assumption, namely, 
that the desirable objective is to group together pupils who are likely 
to attain similar degrees of achievement at the end of the school year. 
It may be urged, that even if a measurement of mental ability, slightly 
qualified by considerations of physical status and fitness and mental 
maturity, yields the best available means of prediction, it might still 
be desirable. While the physically inferior may, or even do, accom- 
plish essentially as much as those of similar mentality and superior in 
weight, size, and other physical traits, they should not, perhaps in 
the interest of health or physical growth, be required or allowed todo so. 
This consideration lies outside the scope of this study; no data are 
available onit. It is plainly a problem worthy of careful investigation. 
A priori there is a defense for an opposing point of view, namely, that 
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health and growth are both benefited, not by idleness or by work on 
materials below the level of one’s powers, but by absorption in tasks 
which are adjusted as closely as may be to one’s ability. For this 
assumption there is, moreover, considerable evidence both from obser- 
vations of school children and from observations of adults by physi- 
cians and others. We have not as yet been presented with evidence 
that promotion, acceleration on the basis of mental ability alone does 
any harm to health, happiness, growth or efficiency; nor have we seen 
any satisfactory evidence that grading mainly on the basis of physiolog- 
ical maturity does any good in these respects. 

Even if it is urged that equality in capacity for scholastic achieve- 
ment is not the only criterion for classification or promotion of pupils 
but that similarity in physiological maturity, in physical strength, 
athletic abilities, motor abilities, social aptitude, inclinations or matu- 
rity, emotional propensities and maturity, and other human traits, is 
also desirable, it becomes more or less essential that similarity in 
strictly scholastic possibilities be maintained; this problem can by no 
manoeuvering be brushed aside. The other criteria of classification, 
however, do merit consideration. 


How Are Pupits To BE SELECTED FOR PHYSIOLOGICAL, SOCIAL AND 
OTHER Puases oF Maturity? 


First, we should consider ways and means of selecting pupils alike 
in physique, in social, moral, emotional and other traits. Our data, 
so far as they go, demonstrate this fact: We do not secure groups uni- 
form in the social, emotional, mental or scholastic traits by grouping 
them on a basis of physical measurements. Whether we would get 
pupils alike even in motor facility and aptitude, in athletic prowess, 
was not determined. In general, it was found that these several 
divisions of traits were far from perfectly correlated. Each must be 
measured by itself and not estimated from something else. Thus, 
social maturity is correlated with ossified area, 0.13; with ossified ratio, 
0.20; height, 0.15; weight, 0.17; with all physical traits combined, 
0.37; with MA 0.26; educational achievement, 0.19; and with estimated 
educational maturity, 0.43, mental 0.69, emotional 0.66, and physical 
fitness 0.66. When allowances are made for the constant error (halo 
effect) of judgments which produces correlations with social maturity, 
itself estimated, which are too high, it appears probable that no group- 
ing will give a compact distribution in all of these traits. 
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No one, we trust, will accuse the writers of having said that 
‘physical measurements are of little value, barely worth the trouble.” 
This is not implied. These measurements are highly to be desired 
for their own sake, for appraising physical growth, regulating diet, 
exercise, etc., for correcting defects, suggesting susceptibilities and 
other uses, but not for the purposes of classifying children intellectu- 


ally, scholastically, socially or emotionally, at least, as these traits are 
here defined. 


EVIDENCE FOR THE PRINCIPLE OF MULTIPLE CLASSIFICATION 


Out data afford some basis for the suggestion that greater fluidity 
and variety of classifications possess certain merits. The class organ- 
ized for work in the three R’s and other scholastic work need not be 
kept intact for athletic, social, mechanical, artistic and other pursuits. 
For each activity, an appropriate group may be sifted from several 
scholastic levels.. Apparently only by some procedure of multiple 
selection may groups of similar aptitudes or attainments be secured. 
Conceivably, such a procedure would have in many ways a wholesome 
broadening and democratic influence. Children would profit by 
association on equal terms in some respect, such as drawing or athletic 
ability, with those unlike in other respects. Membership not only in 
one group—as often happens for the several years of a grammar school 
career—but with many groups would perhaps tend to reduce clannish- 
ness, broaden friendship, enrich experience, and build a finer foundation 
for effective participation in the many sided activities with all varieties 


of people which life in a democracy demands. 


1 We are attempting to arrange several studies of the personal characteristics 
of members of various spontaneous groupings, such as those for athletic, debating, 
literary, social, hiking, activities in and out of schools. That is, we shall seek 
to answer the question: What kinds of children with respect to size, strength, 
skill, sociability, aggressiveness, emotionality, intelligence, etc., tend to group 
together for different purposes? Such information would solve none of the 
present problems but it would suggest, perhaps, difficulties, ways and means, 
natural inclinations which should be taken into account. 
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THE VARIATION AND SIGNIFICANCE OF INTEL- 
LIGENCE QUOTIENTS OBTAINED FROM 
GROUP TESTS 


W. S. MILLER 
Professor of Educational Psychology, University of Minnesota 


Anyone who has used more than one group test doubtless has 
observed the variation in intelligence quotients obtained. Gates 
has published very interesting data showing the variation of intelli- 
gence quotients of elementary school pupils obtained from certain 
group tests. The fact that this variation exists is very apparent and 
needs no further demonstration or elaboration. The common practice 
of interpreting IQ’s on group tests in the same manner as IQ’s obtained 
on the Stanford Revision of the Binet-Simon Scale is misleading or 
even absurd. 

On the basis of this variation some have questioned the value of 
group tests as a means of classification of pupils. This variation gives 
no grounds for the skepticism in regard to their value as instruments for 
classification. The variation does set a practical problem for the 
users of group tests, the problem of equating the IQ’s on the 
various tests. 

This article is concerned with the problem of equating IQ’s obtained 
from 9 group tests, and also with the problem of test validation. 

On June 10, 1922, fifty-seven freshmen entering the University of 
Minnesota High School were given the Miller Mental Ability Test Form 
A, and the Haggerty Intelligence Examination Delta 2, in the order 
named. April 9, 1923, the same students were given Army Alpha Form 
8, Illinois General Intelligence Scale Form 1, and Terman Group Test 
of Mental Ability Form A in the order named. April 10, 1923, they 
were given the Dearborn Group Test of Intelligence Series II-C. 
May 12, 1923, they were given the Otis Self-administering Higher 
Examination Form A (20-minute time limit), and the Pressey Senior 
Classification Test in the order named. Between March 1 and May 
15 they were given the Stanford-Binet Individual Test by the mem- 
bers of a class of 38 seniors and graduate students in the third quarter 
of a course in Mental Tests and Diagnosis. In fairness to the Stan- 
ford-Binet Test it should be stated that these university students were 





1 Gates, Arthur I.: “‘The Unreliability of the MA and the IQ.” Journal of 
Applied Psychology, March, 1923. 
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not skilled in administering it. This should be kept in mind in inter- 
preting the results. 

The group consisted of 30 boys and 27 girls. The range of chrono- 
logical] age on June 10, 1922. was 11-6 to15-3. ThemeanCaA was 13-6, 
the median 13-6.7 and the SD was 9.35 months. The group was 
superior to the average high school freshman class as is clearly evident 
from Fig. 1 and Table ITI. 

The sample ratings of one of the freshmen illustrate the problem of 
interpretation of scores made on group tests. 


TasLeE I.—SampieE RaTINGs OF A FRESHMAN IN TERMS OF (1) Raw Scors, (2) 
Rank, (3) PeRcENTILE Rank or Raw Scors, (4) Menta Aaz, (5) INTEL- 
LIGENCE QUOTIENT AND (6) INTELLIGENCE QUOTIENT TRANSLATED TO 
TentTus or SD wits Zero at Negative 5 SD 





























1 2 3 4 5 6 
Tests P 
Raw | Ran pag Mental | yq | SD 
score age score 
rank 
Miller Mental Ability Test Form 

i ab isa sortnranin "ith, wn: Ocak I ala 73 | 21.5 62 18-10 | 141 | 54 
Haggerty Intelligence Examina- 

FSAI te ee 145 | 15 74 20-0 /| 150 | 62 
Army Alpha, Form 8............ 121 | 20.5 64 17— 1 /| 128 53 
Illinois General Intelligence Scale, 

Ns hs shi ics 65d hae ara 121 | 26.5 54 17—- 7 | 132 51 
Terman Group Test of Intelli- 

gence, Form A............... 145 | 15.5 73 16- 5 | 123 | 55 
Dearborn Group Test of Intelli- 

PES hii a bite ae « «dpa 57 | 22 61 16- 5 | 123 | 54 
Miller Mental Ability Test, Form 

Rie scab: sikin: 6. cia es oldies anima ade 82 | 12 79 20-4 152 57 
Otis Self-administration Higher 

Examination, A........... ae 47 | 29.5 48 15-9/ 118) 51 
Pressey, Senior Classification Test | 53 | 33 42 15-10'| 119 | 50 

| 
| 





1 Raw scores and mental ages have been readjusted to allow for varying dates 
on which tests were given. 


It is obvious that the relative size of raw scores, Column 1, means 
nothing since the possible score is not the same in any two of the tests. 
The simplest method of making the raw scores comparable is to rank 
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them from the best to the poorest, calling the highest one in the group 
Rank 1. Knowing the number of cases in the group the person’s rank 
has meaning which makes it possible to compare his standing in the 
nine tests. The sample ratings in Table I show that this freshman 
varied in raw score from Rank 12 to Rank 33. To get at his composite 
rank in the nine tests one would have to take the sum of the ranks of 


IQs ann SD Scores of 57 UHS Fresen 10 Tests (1922-23) 





Fig. 1.—Intelligence quotients. 


each of the 57 freshmen in the nine tests and make of these sums a new 
rank order. His rank in this new series would not necessarily be the 
mean of his ranks in the nine tests. This method is convenient for use 
in small groups but it does not make allowance for differences of 
chronological ages of the students tested. It also makes it necessary 
to interpret each rank in terms of the number ranked. 

Ranking in terms of a hypothetical 100 students, percentile rank- 
ing, makes possible an interpretation of the rank without reference to 
the particular number rated. The percentile rank means the per- 
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centage of the group making scores at and lower than that rank.! Per- 
centile rank may be applicable either to the group under consideration 
or to some larger group upon which the test was standardized. The 
percentile ranks in column 3, Table I, are made from the 57 cases and 
not from ninth year pupils in general. In terms of ninth year norms 
for the Miller Mental Ability Test this freshman would have a percentile 
rank of 87. 

In order to arrive at an absolute measure that would be directly 
comparable in several tests an attempt has been made to establish 
mental age equivalents for group test scores. This study and others 
show clearly that authors of group tests have not succeeded in estab- 
lishing age norms that are directly comparable. Calculating mental 
ages as of June 10, 1922, the following results were obtained for the 
several tests on this particular group: 


TaBLeE II],—CrnTRAL TENDENCIES AND VARIABILITY OF MENTAL AGES OF 57 
HiGH-scHOOL FRESHMEN ON 10 TEstTs 














Mental age 
Mean Median SD 
RR ws acon sebaabebbieenWuuueen 17—- 7 | 17-4 2-7 
ance a sin ube in ody eae bd een 17—- 2 | 17-2 2- 3 
RE ar ee EE Ra 15- 8 | 15-11 1- 5.5 
RSs coc ancceedeneeubues coed ewes 16- 7 | 16-9 1- 7 
PU Bciesccsosdceuns nine Site 6% aa 17-6 | 17-8 1l- 7.5 
SS iE nikgu'so > Sean 15-9 | 15-11 1- 2 
GES... gh a doa Web bee ove dwede 16-0 ; 16-4 1- 3 
GS OR, ic nose abecus be we taheeeawn 15-10 | 15-8 1- 3 
NS TB. an v's's 00.0 aR aalee eed BURRS 18— 9? | 18-10 1-11 
10. Pressey Senior Classification............ 16- 1! | 16- 2.5 1-10 
Chronological age June 10, 1922........... -| 138-6 | 13— 6.7 |9.35 months 














1 Mental ages were arbitrarily extended beyond 17% the highest mental age 
equivalent reported in Pressey’s tentative norms. 

2 The only explanation the author has for the large difference in mean mental 
age between Form A and Form B of the Miller Mental Ability Test is that after 


the norms had been established the items in Form B were rearranged in order of 
difficulty. 


1 Manual of Directions, Miller Mental Ability Test, p. 12, World Book Com- 
pany, Yonkers, New York, 1921. 
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Assuming that the mean chronological age of high-school freshmen 
is 14 years 9 months in September, and that the mean IQ is 105, the 
mean mental age would be 15 years 6 months. 

The following table gives the percentage of this group of 57 high- 
school freshmen equalling and exceeding this ninth grade norm. 


TaBLe II].—Per Cent or 57 Hicu-scnooot FresHMeEN HavinG AND ExcreEeDING 
Mentat Ace Norm or 15 Years AnD 6 Montus 


RN i a dn ek Ee ee 75.4 
i tele ae tse gal iwe ss Sane Sh ook was bad heb bé do e60 70.2 
en LEU ea b Sods OSS TRE Re ha en Tek ce caeiecccees 61.4 
MN 688 bs Gere 6 Sete CORSE is SRT OCES EINER SEUSS SCS ie 77.2 
LS. 2 as, old a bb abs o 0:6 COW Ree e be ek we bebe cence der 91.2 
Fe ns eee bade ol nt diese oa Mie hoe 64.9 
RRR SE Se TE REE NS Ca gaye Rin 73.7 
ea de LA Meade Sas we eee a oh ane’ he Obes aed des 94.7 
ee ee oer a slea ae ee eG bee k obese bbedaeddec 59.6 
Pressey Senior Classification................cccceeeeeecees 68.4 


The mean mental age (Table II), varies from 15 years 8 months on 
the Stanford-Binet to 18 years 9 months on Miller B, a difference of 
3 years 1 month. The large differences between mean mental ages are 
due in most cases to the difference in the upper age limits. Some of 
the tests allow no mental ages above 19-6, others allow mental ages as 
high as 26 years 8 months. 

It is obvious that a test which allows no mental ages above 19-6 
cannot rate properly in terms of IQ a student who makes a mental 
age of 19-6 or above, as two of these freshmen did. One of them 
exhausted the SB scale at 14 years 1 month, CA, giving him an IQ of 
138. One year later with the same performance his IQ would be 122. 
Assuming that at 10 years of age his IQ was 150, it is very apparent 
that a test to yield the same IQ at 16 must provide a mental age of 
24 (1.50 X 16). Obviously tests that do not provide such an exten- 
sion of mental age will yield too low IQ’s for superior adults. There is 
a further complication of the IQ rating in the difference of opinion in 
regard to the age at which mental maturity is reached. 

Probably the best solution of all these difficulties is to do away with 
the IQ concept in dealing with superior high school students. This 
certainly would be desirable after adult age 16 is reached, since the 
denominator in the mental age-chronological age ratio would be 
constant. There is a decided advantage in preserving the IQ concept 
in expressing the rating of children under 16 years of age, since the 
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difference in the degree of brightness of a 12 year old and a 15 year old 
making the same score is marked. | 

If we are to continue the use of the IQ on the high-school level some 
method is necessary to equate the IQ’s obtained on the different tests. 
In the sample rating, Table I, column 5, it will be noted that this 
freshman’s IQ varies from 118 to 152. In no test does he receive an 
IQ that would place him in the lower half of the high-school freshmen. 

Early in the history of testing Woodworth! proposed that the com- 
bining of scores on different tests might be made possible by translating 
the scores into standard deviation or average deviation units. If 
the central tendencies and standard deviations of the distribution of 
1Q’s obtained from group tests were identical with those obtained from 
the Stanford-Binet Test on the same group of students a translation 
into terms of variability would be useless since IQ’s could then be 
compared directly. Results of the tests show that neither central 
tendency nor variability measures are the same on any one group of 
students, hence the IQ’s are not directly comparable. 

The author proposes, therefore, the translation of IQ’s into tenths 
of SD of the distribution taking negative 5 SD as zero. The trans- 
lation may be greatly facilitated by the graphic method shown in Fig. 
1 in which IQ’s are plotted on the abscisse and tenths of SD on the 
ordinates. Given the IQ on any one of the tests it is possible to read 
directly the corresponding SD score. For example, an IQ of 145 
on Alpha is 1.5 SD of the distribution of this group above the mean 
IQ of this group on Alpha. On Dearborn II-C or SB an IQ of 135 
would be the same distance (1.5 SD) above their mean IQ’s. 

Referring to Fig. 1, one can see how absurd it would be to think of 
an IQ of 140 as “‘near genius or genius’? if the IQ had been obtained 
from Alpha, Pressey, Delta 2, Illinois, Miller A, or Miller B. On 
Miller B an IQ of 140 in this group is only 0.4 SD above the mean of 
this group, while an IQ of 140 on SB is 1.88 SD above the mean of the 
group. It is apparent that an IQ of 175 on Miller B which is 1.88 
SD above the mean IQ of the group might more likely be thought of as 
designating ‘‘near genius.”’ 

An investigation of the high school marks which these students 
received revealed that none of the letter ratings, A, B, C+, C, C—, D, 


1 Woodworth, R. S.: ‘Combining the Results of Several Tests: A Study in 
Statistical Method.” Psychology Review, Vol. XIX, pp. 97-123, 1912. 

2Terman, Lewis M.: “The Measurement of Intelligence.”’ Houghton 
Mifflin Co., New York, 1916, p. 79. 
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and F had the same meaning for any two teachers, so the marks were 
translated into SD units after having been weighted as follows: A, 11; 
B, 8; C+, 7; C, 6; C—, 5; D, 4, and F, 1. 

The correlation coefficients in Table IV were calculated from the SD 


. ratings by the Pearson Product-Moment formula. These correlation 


coefficients average about .01 lower than those calculated from the 
IQ’s themselves. The correlations are more easily calculated from the 
SD scores since the numbers are smaller and the mean for each test is 50. 

The intercorrelations indicate that the 10 tests are measuring very 
much the same thing with rather marked consistency. In column 12 
are the correlations of each test with the mean of all other tests; all 
of these correlations are positive and high—above 82. Half of them 
are above 90. 7 

In column 16 are correlations with a criterion external to the tests, 
the mean of all school marks for the freshman year. All of these 
correlations are above 49; eight of the ten are above 60. 

The intercorrelations of the three high-school subjects are in general 
much lower than the intercorrelations of the tests. The correlations 
between English and general science (34), English and mathematics 
(57), general science and mathematics (59), are all lower than are the 
correlations between some of these 30-minute tests and marks for 
one year in the same subjects, for example, Miller A and English, 66; 
Illinois and English, 60; Miller B and English, 63; Otis and English, 67; 
Pressey (16 minutes) and general science, 62. 

It is safe to conclude from the study of these 57 high-school fresh- 
men on these 10 tests that 

1. The mental age norms vary so much that it is impossible to 
interpret the IQ’s from all group tests according to the S-B standard. 
IQ’s obtained from Terman A, Dearborn II-C, and Otis S-A,A, give 
distributions on this group most like the SB distribution; 

2. That direct comparison of IQ’s on several tests is greatly facil- 
itated by translating the I1Q’s into variability units, in this case into 
tenths of SD with zero at 5 SD negative; 


3. That on the four tests named in (1) superior high-school students 
have IQ’s uniformly too low; 
4. That for classification purposes, the 10 group tests are valid 


instruments, judging from the correlations with the average of fresh- 
man high-school marks. 











STATISTICAL ISSUES 


RAYMOND FRANZEN 


University of California 


Disagreement among specialists is not conducive to faith in prac- 
titioners. Recent statistical discussions have exposed a lack of 
agreement among the statisticians that will inevitably widen the breach 
between “‘measurement”’ and “‘democracy.” This is more serious 
because school practitioners and critics of measurement do not read 
statistical discussions and do note the conclusions of statisticians. 
When there is disparity of belief, they mistrust all expert opinion. 

Before entering into a detailed discussion I will outline the issues 
involved. My first criticism is primarily of myself,' in the second place 
of Toops and Symonds,’ and in general of the use of age as a common 
denomination of test scores. The use of age to express the scores of 
various tests assumes a like correlation with chronological age for all 
tests used. ‘Toops and Symonds in an otherwise admirable treatment 
of risks in usage of the AR overlook this fundamental weakness: 

“Tf the AQ procedure is to have a monopoly on Stanford IQ’s, it 
necessarily must have a monopoly on Stanford MA’s, for it will be 
seen that the CA’s cancel out in Equation (1) satis | Ves two simple 


CA 
MA 
CA 
But EA and MA are not “simple variables since they are positions 
on the regressions of score-age correlation scatters. The EA’s of a 
test are a function of the correlation of that test with chronological age. 
My second criticism is of Chapman’s conclusions in regard to ‘‘ The 
Unreliability of the Difference between Intelligence and Educational 
Ratings.’”’* Chapman shows that if the correlation between an intelli- 
gence test and a product test is 0.70, then the reliabilities of the inielli- 
gence test and of the product test must each be 0.93 in order to havea cor- 
relation of 0.75 between repeated indices. (Anindexishere used to mean 
an intelligence rating minus an educational rating, each being deviations 


variables, EA and MA.” | Equation (1) is AR = 





1 Franzen, Raymond: The Accomplishment Ratio, Teachers College Contribu- 
tions to Education, No. 125. 
2 Journal of Educational Psychology, Vol. XIII, No. 9, and Vol. XIV, No. 1. 
8 Journal of Educational Psychology, Vol. XIV, No. 2, pp. 103-108. 
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expressed in nqultiples of their sigma.) It follows equally from Chap- 
man’s formula, however, that if the reliabilities are both 0.9 and the 
correlation between intelligence and product is also 0.9, the correlation 
of the indices is zero. It also follows—and this is more important— 
that if the reliabilities are 0.9 and the correlation between intelligence 
and product is only (.1, the correlation of indices would be 0.89 and that 
if the reliabilities are 0.7 and the correlation between intelligence and 
product is 0.1, the correlation of indices would be 0.86. Theseexamples 
are not given by Chapman. They show that the criticism of unre- 
liability of indices is not general. Intelligence-minus-product computa- 
tions are unreliable when they are small, and reliable when they are 
large. When they are small they are unimportant, since it is large 
ones that influence practice. Chapman says:! 

“Such facts as are presented above must be recognized by those 
who propose to determine the difference, within a single grade, of 
intellectual and school achievements when measured by such instru- 
ments as are at present available.” 

And these facts involve this assumption: 

‘‘We will assume, as is reasonable, that the true correlation of the 
ideal intelligence test and the ideal school test is 0.7.” 

This is not at all reasonable since such correlation is a function 
of the efficiency of the instruction and varies from 0.5 to 0.9. When it 
is 0.5, the indices are used and then they are reliable; when it is 0.9, 
they are not used—and then they are unreliable. This does not mean 
that the fact that they are small is unreliable, but that distinctions in 
size are unreliable when they are all small. 


Otis? gives evidence to prove that (« = me y) is a better prediction 
: + 
than the more usual (« =r A y). This is comparable to saying that 
y 


x = ¥ ] zx on y ~ 9 _ 
ni * is better usage than = r®). Now what Otis’s assump 
tions in this proof mean, is that when you are interested only in the 
common element of x and y and consider everything else chance, then 


= =# except for chance, since then r,,= 1.00 except for chance. 
z y 


It seems to me to be necessary to prove that r., can be near 1.00 before 





1 Journal of Educational Psychology, Vol. XIV, p. 108. 
2 Journal of Educational Psychology, Vol. XIII. 
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it is allowable to use this prediction. When, is near zero, = obviously 


does not equal ath But if (when z = intelligence and y = product 


a 
and 1 = testing before special training in y and 2 = testing after special 
training in y) Teiyi < 0.6 and rzey2 > 0.8 and both reiz2 and ryiy2 > 0.8, 


then (= ~ a ) is allowable. This is the case with some tests. We 
z y 
use these indices in Contra Costa County, California, and have them 


tabled for every age. Wecould not, however, assume that = = = 
z y 


since the r before special training in y is low. The o of such 
predictions as these is the better way to express reliability of indices 
and will be considered in the section devoted to criticism of Chapman’s 
article. 

The third portion of this paper is an examination of the treatment 


of the reliabilities of certain reading tests. Monroe uses “~ _ : 


as his measure of the reliability of tests. This is 77 V1-—rand 7 is a 


proportion. Proportions are in part functions of the location of zero 
which is different for different tests. I take issue with the following 
conclusions: 

‘‘We are, however, interested in securing a measure of the departure 
from perfect correlation. Hence the probable error of measurement is 
a much better index of the degree of reliability than either riz or r1,”’ 
(p. 34). 

“As we have indicated, the ratio of the probable error of measure- 
ment (PE~/1 — r,;) to the average probably furnished the most sig- 
nificant statement of the degree of unreliability” (p. 36). 

“The coefficient of reliability is shown not to be a satisfactory 
measure of reliability” (p. 52). 








1 Since r gives you the relation between = and £, it gives you the average = 
z v z 
Y 


for a given 8 and this prediction varies with thesizeofr. Predictions are affected 
o 


by r in two ways: (1) the size of increase of = for given increments of z, and 
z u 


(2) the spread of = for a given + which is 4/1 — r?. 
z y 


2 “Report of Progress of Measurement in Contra Costa County, Cal.” 
* Bulletin No. 8 of the Illinois Bureau of Research. 
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I. COMMENSURATION AND Acer! 

Translation of scores into age has seemed such a simple solution 
to our need of commensuration that a wholesale acceptance has blinded 
us to its retail inadequacies. I confess to my own share in the concoc- 
tion of quotients and even admit a practical advantage in their use. 
But I think the time has come to emphasize the crucial shortcoming of 
a common unit age and supplant it by a better common denominator. 

Assuming equal reliability then we know that the correlation 
of each of a number of tests with chronological age is the same, we may 
use average score of ages as their common denominator. Comparisons 
must be limited, however, to the correlations attainable between the 
tests. Thus, to say a child with a mental age of 10 may obtain a 
reading age of 10 we must know (1) that both tests correlate alike 
with age and (2) that the correlation between the tests is very high when 
reading is at its maximum. These two considerations are a practical 
necessity. We shall consider the first. . 

Since tests correlate differently with age, they measure precocity to 
varying degrees. Whereas it does no harm that one test shall express 
its scores as ‘‘mental age,’”’ for many tests to do so results in confusion 
unless increases in age yield the same increments of abilities in each 
measure. If tests are differently affected by age, then average scores 
of successive ages mean different increments of abilities on different 
tests. There is the additional difficulty that in many tests it is by 
no means clear that we want to measure precocity as on these tests 
differences within an age group are unlike in quality to differences 
between age groups.” 

Table I gives the correlations of each of 14 tests with chronological 
age, the data being 57 ninth grade children.* The o of ages is 10.73 
months. The average age is 168.82 months. 

The distribution of scores in each test is an array or two of a score- 
age scatter. The higher the negative correlation between age and 
score in this portion of the table limited by chronological age, the lower 





1The argument here is directed against the assumption that a translation of 
scores into ‘‘mental ages” and “educational ages’’ makes them comparable and 
is, of course, entirely unrelated to the use of ‘mental ages” in the expression of 
Binet scores. 

2 The importance of challenging the value of age as a common denominator of 
test scores may be realized through its uncritical acceptance in a recent book, 
“‘Mental Tests and the Classroom Teacher,” Virgin E. Dickson, 1923. 


*Franzen, Raymond: Attempts at Test Validation. Journal of Educa- 
tional Research, September, 1922. 
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may we expect the positive correlation between age and score in the 
table as a whole. The reliabilities of these tests are substantially 
the same except for a few of the shorter ones. Thus, if all ages are 


TaBLe I 
Test r with CA 
PS +A’ «uidie uaeumnisie de muemnde Maik kh wibe de ited me ae eh — .49 
RSE Pe Sere oie eee ee EEO ne Ce Te eee — .47 
tape tipeic. her ene AIR PRD CHP Hee — .45 
CC’ lL. «65 ees ob 6456 diet hab aceite a — .42 
WL So ER Uae bee c eUs choker ees he SURE ee cece bees — .39 
NE NS it vie cin w sd 06s ba 0eeeean Nie bbe vet eediee vat — .38 
ELGG REET AEC oy Oe Se ele Tae ge re — .33 
als ss caruiti tae ds nie a GNA te ebte ek) be bakes 0s — .32 
CD 5 acs ht ha de ae eeine sc aikeande eaehbe Ode edete * — .30 
a a oid CU LEY us own Ga aes oben aa kee yee ds — .29 
EE och nau Saew de wk 6 ssa bebe ee dec kc heedebens — .27 
xin 6 on Se ae Ke o US neh ee kbdbew eeauedt — .24 
NN ined in dbs bewedidaok vietes bball dvekaeadtawadees —.14 
EE I Se Ee re ee ee Oe Ee —.12 


used, the correlation between age and score will probably be lowest 
for the Wylie, second lowest for the Illinois and highest for the Dear- 
born-2, etc. This is substantially true of correlations between age 
and score on these tests in a group unselected for age. In any case 
their correlations with age do differ apart from differences in reliability. 
It means that the sigma of scores with age constant is much larger in 
relation to the sigma of scores unselected for ages in the Wylie than it is 
in the Dearborn-2. There is much more overlapping of scores in 
distributions for successive ages in the case of the Wylie. If we have a 
ceeaaen of scores not classified by ages, then our sigma equals 


2 
(= )- Now (when z = scores and y = age) if we take our devia- 
tions of score from the regression instead of from the average, 


each deviation is z “r y and the sigma around this regression line 





n 


zero. When rz, is zero, then this o is equal to the totale. This is the 
partial sigma since it is the sigma when each deviation is taken from 


z(t — — *y)*\ siibhiiliad 
equals =0:\/1—r?,,. When rey is 1.00, this o is 


the average score for the age group to which the individual belongs ° 


instead of from the average score of the total group. The sigma of 
Wylie scores, irrespective of the effect of age, is almost as great as the 
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sigma where age is a factor which causes variability, whereas the 
sigma of Dearborn scores is greatly reduced when age is rendered 
constant. 

From this it follows that in different mental tests ‘‘mental age”’ 
means different things (even if they be assumed to measure the same 
human qualities). 

Let us assume that the correlation between Test A and age is 0.80 
and the correlation between Test B and ageis0.44 ina group unselected 
for age and comprised of many children, ages 8 to 14 inclusive. Let us 
further assume that the average scores for various ages are as follows: 


AGE MEAN ON Test A Mean On Test B 
8 12 12 

10 22 22 

12 32 32 

14 42 42 


Let us assume the sigma of age to be 1.00. With these assumptions 
we may calculate what the sigmas of Test A and Test Bare. Let 
x = age, y = Test Aandz = Test B. Then 


Ty 
VG. _ 5 
8o, = 5 
oy = 6.25 
Similarly o. = 11.36 


The sigma of scores on Test A within an age group is likely to be: 
oy(1—r*zy)* = 6.25(1—.64)* = 3.15 


(This is the standard deviation when each individual deviation is 
taken from the average score of the age to which the individual belongs 
instead of from the average of the whole group.) The sigma of scores 
on Test B within an age group is likely to be: 


C.(1—r*,,)? = 11.36(1—.19)? = 10.22 


We have now assumed everything to be the same in the two tests 
except their respective correlation with age and as a consequence 
their sigmas for a given spread of age (sigma of age was taken to be 
1.00) and their sigmas with age constant. The partial sigmas are 3.15 
in Test A and 10.22 in Test B. To get a mental age of 12 on either 
test a child must have a score of 32. For an eight-year-old this means 
a positive deviation of 20 units since the average of eight-year-oldsis 12. 
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: 20 . , 
This means an = of 3 1512 Test A and of ao in Test B. Then to 


attain a mental age of twelve an eight-year-old must be 6.350 above 
the mean of his age group in Test A and only 1.960 above the mean of 
his age group in Test B. And so with all other examples, being above 
the average of one’s own age group means less in terms of “mental 
age’’ or ‘‘educational age,”’ the higher the test correlates with age. 
This is no small difference. In the example given an eight-year-old 
child who is 3c above his age’s mean in Test A will have a mental age 
of about 10, whereas a child who is 3¢ above his age’s mean in Test B 
will have a mental age of about 14. Still, if Tests A and B are 
highly correlated, with age rendered constant, then the same child 
will get these scores and will be pronounced mentally aged 12 and 14 
respectively on two tests measuring the same quality. 

The issue is whether or not we are to accept precocity as our only 
objective of measurement. Concretely put, is the child who shows 


equal ability in Tests A and B, judged by his = in his own age group, 


but with a mental age of 12 and 14 respect’ “ely on these tests, really 
of equal ability in Tests A and B, or is he better in B? Obviously we 
would not be willing to say that a child who was three, four or five 
sigma above the average of his age in an intelligence test was not 
indicated as extremely bright, even though this did not mean a high 
age rating as would be the case if the test correlated very high with age. 
Also, if a child of eight years were only one-half of a sigma above the 
mean score of eight-year-olds, we would be equally loath to call him 
extremely bright even though his mental age were 14 because this 
mental-age rating only reflects the low correlation of the test with age. 

All kinds of tests are desirable—those highly correlated with age 
down to the very desirable intelligence test which will have a zero 
correlation with age. (This one would mean that a “mental age”’ 
for every score over average is indeterminate if we remember that 
“mental age” is defined as that age of which the obtained score is the 
average, and that a zero correlation means an entire overlapping of 
the scores of successive ages.) We want very much to have a test 
which will measure the brightness of a child directly, independent of 
what age he is. We cannot change tests so that they all have the same 
correlation with age, which is the thing we would need to do in order to 
be able to use age as a common denominator. In the first place it is 
practically impossible to change them, and in the second place we 
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would lose too many other values if we did. We want to know how 
age affects different qualities and so we want to know the difference 
in correlations of tests with age. 

Undoubtedly it is important to know the “reading age” of a 
child and his “arithmetic age’”’ and his “‘mental age.”” But the mis- 
take we make is in assuming that the same precocity in different tests 
means the same degree of excellence. We should know not only 
where a child stands in relation to the various age means, which 
locates him on the growth curve, but in relation to the spread around 
each of these means, which takes into account the overlapping of that 
particular test and often changes our attitude and judgment. A child 
may have a mental age of 14 and a reading age of 12 and an arithmetic 
age of 13 and still he may have done exactly as well on the arithmetic 
test and on the reading test as he has done on the intelligence test pro- 
vided that the three tests have different correlations with age. 

Use of deviations expressed in multiples of the o of alarge unselected 
group in Contra Costa County has confirmed this view of the superior 
value of that common denomination to that of age.! 


II. Tue RELIABILITY OF THE DIFFERENCES BETWEEN INTELLIGENCE 
AND EpUCATIONAL RATINGS 


Caution in our use of tests and especially in our commensuration of 
tests is a topic appropriate to our present needs. The recent article 
by J. Crosby Chapman,' however, uses formulas which indict a little too 





ras =r(— —4)( 4 — #). 


Zi un) Le T 2iz2 + Tyly2 ~ Teiy2 ~ Tz2yl, 
Or1 Tyis \Oz2 Ty2 2 (1l- Tsiyi)? (1- T 2242)? 


When 2; and 22 are intelligence tests and y; and y2 are product tests, 
this is the correlation between the differences (intelligence minus prod- 
uct), when all scores are deviations expressed as multiples of ¢. Now, 
if all the correlations involved were 1.00, this r would be indeterminate 
and this is id understood since then there would be no variability 


in either (7! — 4 or in ( us) since if rz1y1 and rz2y2 both were 
Cyl 22 Ty2 
1.00, the n—= a oo =. 
Gz1i Tyl O22 Ty2 


1 The use and results of this method are embodied in a report of the Program 
of Measurement in Contra Costa County, California. 
2 This journal, February, 1923. 


severely a procedure of comparison of Sigma Indices. He shows that 
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Assuming the reliability r’s within an age group for intelligence and 
product (rziz2 and ryiy2) both to be 0.8, since tests with this reliability 
are available, then let us see what Chapman’s face will yield first 
when the correlations between intelligence and product (reiyi, 7212, 
Ts%y1 ANd Teey2) are all 0.8 and next when they are all 0.5 and last when 
they are all .1. 


When they are 0.8 
. 8+8-8-8 90 
Tare? ~ 21 — .8)4(1 — 8)? 4 
When they are .5 





= 0. 


— 8+8-5-5 6 
Tala? ~ 9(1 — .5)4(1 — .5)? «1.00 
4 





= .60. 


When they are .1 
ee 8+ 8 ate l ot l a 1. oe 78 
et a ee. we 





Thus given the same reliabilities, that is, the r’s between successive 
tests of the same kind (intelligence with intelligence and product with 
product), raice grows less the higher the correlation is between intelli- 
gence and product. This is very reasonable since the lower the corre- 
lations between the intelligence tests and the product tests, the greater 


is the spread in (= _ 5] and the lower these correlations when the 


reliabilities are high, the more likely is a large =t _ = associated 
zl Cy 
with a large oo #) and a small ( =! — ys) associated with a small 
T2z2 Ty2 G21 Gyl 


(2 sd #) since these very differences make the correlation rzy low. 
z v 


The higher the r’s are or the nearer they are to the reliability coeffi- 
cients, the less reliable are the (= _ Y\s since the lack of such differ- 
os y 


ences makes the r., high. 
For this reason 7a1e2 is not a measure of what rp is used to condemn. 
Yi 


No one desires to have individual variation in = sak correspond 
zl yl 


with individual variations in = — me except when r,,islow. The 
y 


point here is that Chapman uses an assumption of high correlation 
between product and intelligence to prove the “unreliability of differ- 
ences”’ when it is precisely in the case of low correlation that we need to 
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demonstrate reliability. What we want is reliable predictions of 


2 given an — of any amount. How likely is the 2 to be 1.6 if the 
y 


uv z 
~ is 1.8? What are the probable limits of z for an — of — 0.97 


Oz v z 
in a correlation of 0.72? How sure are we that a given difference is 
large? These questions we wish to answer and not questions such as 


the following: Are those whose “a _ a) is relatively large also those 
a “ 
whose =8 _ 3) is relatively large? This merely asks whether indi- 
z2 y2 


viduals are distinguished in one appraisal as they are in another, 
but has no reference to the importance or size of the distinction. Thislatter 
question is answered by raiaz2 and means even less than it would seem to 
as expressed above, since the more reliable become the predictions of 
2 from =» the less reliable become the differences. That is because 
Vy z 

when rz, is high, then all differences (intelligence minus product) are 
small, and so their relative size is unreliable and also unimportant. 

The measure we desire in order to estimate the reliability of our 


’ 


predictions is the o of our + s with = rendered constant. Just as 
y z 


ox(1—r?,,)' is the o of » with , constant: i.e., the o around the 


=f oF 
ZG eG, 
* n 


with = constant is +/1 — r*,, (since cof 4 = 1.00). Then the o 





y 
a o?(1— r2.,) ’ so the o of “ 


regression P 
o v 


ry 


z Cy 
around the prediction of o from = =Vl-ry = z = Try = z 
y z yv z 
.6745 +/1 —r?,, This expresses the average and PE.of all £ s 
y 


with a given r and a given — It affords the opportunity of predict- 


z 


ing the most likely # for any =, plus or minus its probable error, and 
y z 
by reference to tables which express percentage of cases for given 


deviations, the computation of the probability of any z fora given ~ 
y z 


and a given rz,. These Sigma Indices (Standard scores plus a con- 
stant) may be tabled for every age and grade as well as the predic- 
tions of one from the other and the probable errors. We may use 
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commensuration without “injustice” and without “assumptions”’ if 
we use all of our data. 

Since the correlation of one form of a test with an infinite number of 
forms is the square root of the correlation of one form with any other,' 


the o of an individual’s standard score (=) with the truth constant, 


and therefore the variability of a standard score due to errors in 
measurement, is ~/ 1 — ri(r1is the correlation of one form of the intelli- 
gence test with any other similar form). Theo of the #. with truth 


LS PS? y 
constant is ~/1—r, (re is the correlation of one form of the product test 


with any other similar form). Kelley has given us? the o of the differ- 
ence (= - v) of these scores for any one individual. This is the o 
z u 


of the difference of an individual’s standard scores because it deals 
with variability for constant values of truth (truth being defined as 
the average of many forms of the test). 


oa = (2-71 -— r2)* 


A use of this o shows the reliability of differences between intelligence 
and product and gives us a clear definition of which differences are 
significant. We want to know the size of the probable errors of our 
differences and not Chapman’s raid2, mainly because the reliability of 
these differences is a function of their size and their size is a function 
of the efficiency of instruction. It is precisely when differences are 
large that we are interested and it is then that these differences are 
reliable. 

Chapman’s contribution is, of course, valuable since it draws atten- 
tion to the very great danger of using differences which are unreliable 
as differences between possible and actual achievement. Great 
caution is necessary to establish the validity and reliability of the 
tests used so that we may not use tests which resemble each other too 
closely as intelligence and product tests, respectively. I entirely agree 
with Chapman that many of the tests which are today being used to 
signify differences between intelligence and product measure much the 
same quality and are different mainly because they are unreliable. 
Use of Kelley’s formula to determine the reliability of differences will 
improve our diagnoses. It is shown by raja_ that the correlations 





1 Kelley, Truman L.: “Statistical Method,” 1923. 
2 Journal of Educational Psychology, September, 1923. 
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between intelligence and product must vary in size in order that the 
differences between them may be useful. 


III. THe DETERMINATION OF RELIABILITY 


We are all seriously concerned with the comparative reliability of 
available tests. It is important that, given the same data, we agree 
and so it is vital that we use the same statistical criteria. It is very 
much to be hoped that the time is near when at least two forms of 
every test applicable to the ages 8 to 16 inclusive, be given to one thou- 
sand unselected 12-year-olds and adequate comparative statistics 
compiled. Bulletin No. 8 of the Illinois Bureau of Research is a timely 
treatment of an aspect of this problem. Queries as to the statistical 
media upon which the judgments of this study are based need imme- 
diate answer if any agreement on reliability is to be reached. 

The contention that the ratio of PEx/1—r: (called PEy in the 
study') to M is a better basis for judgment of the reliability of the 
test than r, or than +/rj, is of sufficient importance to demand critical 
examination. Judgments of reliability of the test used in that study, 
based on the data presented in that study, are entirely different, 
dependent upon which medium we use to form for our decision. 

1. PE of Test 1 with values of 2 constant = PE~/1—r*,..2 We 
use this formula when we wish to know the error of our estimates of 
Test 1 from known quantities on Test 2, since this is the spread of rows 
(or columns) in a correlation scatter and hence the spread from the 
regression prediction. If 1 and 2 are two forms of the same test we are 
able to judge the probability of any particular score being obtained by a 
particular child on Test.1 when we know the score that child has made 
on Test 2, since we then know the PE of the distribution around the 
prediction. It is desirable to know the spread of scores in a test when 
the average of an infinite number of forms of the test is constant. We 
would then know the variability of scores for any one true amount and 
this is the PE of an individual’s score. 

2. The correlation of one form of a test with the averages of an 
infinite number of forms of that test is the square root of the correla- 





1In this paper we will use PE,, to designate PE+/1—r as it was used in 
the Illinois Bulletin. 1; shall mean the correlation of any two forms of Test 1, 
r2, the correlation of any two forms of Test 2, etc. 

2 Yule: Introduction to the Theory of Statistics, p. 177. His formula is 
used above in this paper. 
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tion aE form with any other. This is expressed by the formula 
rt = ry.) 

3. PE of Test 1 with true values constant = PE~/ 1—r; [substitut- 
ing (2) in (1)]. We use this formula when we wish to find the spread 
of obtained scores for any one true score (a true score being defined as 
the average of an infinite number of forms), knowing the score on one 
form of a test and the correlation of any two forms of the test. If it is 
distance which we are measuring, then the probable error of this esti- 
mate of truth from our measurement will'be expressed in the units of 
space we used in our measurement. If these units are inches, then 
this probable error is in inches; if they are yards, then this probable 
error isin yards. It is obvious that we cannot use this formula then as 
our medium of comparison of the reliability of tests since each test has 
different and incommensurate units. 

Therefore, in the study in question this probable error is divided by 
the mean of the scores of the test: ‘‘. . . the ratioof the probable error 
of measurement to the average probably furnishes the most significant 
statement of the degree of unreliability.” If the zero of each test 
were a real zero, then this statement could be endorsed, since the use 
of this ratio would then obviate the only objection to the use of the 
probable error of measurement as the medium of our judgment of 
reliability. If one probable error of measurement is in inches and 
another is in yards, and the average of the first measurement is in inches 
and the average of the second measurement is in yards, then these 
ratios will give credible results since there are here real zeros.* Theratio 
of the probable error of measurement to the mean expresses what 
percentage a probable deviation with truth constant is of the distance 
between the zero of a test and its mean. Its reciprocal (M divided by 
PEy) gives the number of PEy’s between the M and zero. The 
zeros of the tests used in this bulletin mean merely an accident of 
choice of questions. How easy the easiest question may be is usually 
an accident of the perspicacity of the author, his persuasion with re- 
gard to testing the lesser reading abilities and his intention with regard 
to the range of abilities to which the scale shall apply. The zero is then 
a variable. If we make the zero of a test represent less ability than it 





1 Kelley, T. L.: “Statistical Method,” 1923. 

2 Bulletin No. 8, Illinois Bureau of Research. 

8 We are assuming here that units on the scale of any one test are equal in the 
sense that they represent equal increments of the ability measured. This assump- 
tion is, of course, at the root of all statistical expression. 
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did before, the mean, of course, becomes higher entirely independent 
of any real change in registered abilities. Granted, however, that 
the abilities of a group were measured adequately by the portion of the 
scale used before this change, then the standard deviation remains the 
same and the PEy remains the same. Even if some of the abilities 
measured are changed by the inclusion of easier elements, the mean 
changes much more than the standard deviation. Consequently, 
this ratio of probable error of measurement to the mean depends for its 
value upon the location of zero. Add some 50 easy questions to 
your test and you reduce this ratio by increasing the mean. Still, 
reliability has in no sense been increased. 

To illustrate this, suppose that seven children take a test with 10 
questions and get consecutive scores from 1 to 7 out of the 10 correct. 
The mean is 4 and the standard deviation is 2. Add five very easy 
questions to the test, questions which each of the seven children get 
correctly, so that the scores now range consecutively from 6 to 12. 
The mean is now 9 and the standard deviation is still 2. The probable 
error of measurement remains the same before and after the addition 
of the five easy questions. Suppose the correlation between Form 1 
and Form 2 of this test to be 0.64. The probable error of measurement is 


2~+/1—.64 or 1.2 before and after. The ratio preposed by Monroe is 
1.2 


4 or 0.3 before the five questions are added to the test and +2 or 0.133 
after the five questions are added. Still, adding the five questions has 
not changed the test’s reliability since all five new questions were 
answered by all children. Radical differences in conception of relia- 
bility of a test judged by use of the ratio, result from changes in the 
location of the zero. . 

Conclusions with regard to the relative reliability of tests made in 
the Illinois study are not justified because the zeros of these tests defy 
comparative location. Judgments based on +/1—r; lead to entirely 
different but better conclusions since they enable us to predict the 
variability of deviations from the mean expressed in multiples of sigma 
for any one true value measured. 

When y represents the scores of individuals, each being the average 
of an infinite number of forms of the test, and z represents the scores of 
these individuals in any one form of the test, then rzy is equal to the 
square root of the correlation between any two forms of the test. The 
relative size of these r’s gives us a true index of the relative reliability 
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of tests. The o of = ’s for any given true value of = is V1 —r?,,. 


This is the best medium for comparison of the reliability of tests. 


It is like the PEy except that it is a o of standard scores instead of ac 


of units of the scale. It really avoids error due to difference of units in 
different tests. 

Even though the zeros of tests were all the best zeros it is possible 
to obtain, the ratio of the probable error of measurement to the mean 
would still not net us any better judgment of reliability than the correla- 
tion of one form of a test with the averages of an infinite number of forms 
of the same test. This is because the best zeros of human measure- 
ments must be located by the use of ¢. Dividing the probable error of 
measurement by the mean is done in order to account for differences 
in the units of measurement used in the different tests. We have 
pointed out that this method does not account for differences in the 
location of zero. Let us suppose that zero is the best zero obtainable 
in every test, though the units mean different quantities, in order to 
judge whether the proposed ratio would then gain any advantage. 
Let us suppose a large unselected group of 12-year-olds furnish us 
our data.” Then the zero of each test would be about 8 PE below the 
mean. Of course, frequencies do extend to infinity from the mean 
but the assumption is no violent one that 8 PE below the mean is for 
practical purposes zero of the ability measured. Our conclusions 
would not be changed if we called this 10 PE or 20 PE as long as all 
tests had their zero at the same number of ¢ below the mean which they 
would need to have if all zeros were located in the same manner. 

The average of Test 1 would equal 8 times the PE of Test 1 and the 
average of Test 3 would equal 8 times the PE of Test3,etc. Theformula 


which Monroe uses then reads arr V1— n=tV1l-—nr;. Of course, 


we might just as well use +/1—r as 44/1—r, since judgments about 
relative reliability remain the same. The ratio of the probable error 
of measurement to the mean is a product of two variables: (1) The 
reliability coefficient and (2) the location of zeros. When zeros are 


not properly located, we are not justified in using 7 or any of its forms. 


V/1—r;, is the proper medium for judgment of reliability; that is, the 
coefficient of reliability is a satisfactory measure of reliability. 
There is a further conclusion justified by this discussion. The 


proposed ratio may be written 0.6745 uv 1—r,. Itis the coefficient 
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of variability (with PE instead of o in the numerator) times +/1 —r,. 
Our criticism may therefore be extended to any use of 7 as a measure- 


ment of comparative variability when zeros are not adequately 
located. The only use that a has in the measurement of human abilities 


is as a location of zero. It shows what proportion a is of the distance 
; M 
between zero and M. Its reciprocal, ry shows how many o’s below 


Co 
_ 
would be the same on all tests with comparable data, because then the 


zero would have been located by the use of = a7 would always be 


the mean zero is. If zeros of all tests were properly located, then 


about 0.20 for 12-year-olds (speaking of tests applicable to ages 8 to 
16 inclusive), since 0 should be about 5c below the mean. 

In measurements where there is an absolute zero such as linear 
space, variability independent of units may be expressed by this 
‘coefficient of variability,’’ but our data have no such absolute zeros. 
No advantage in the measurement of variability is gained when zeros 
on the tests compared are properly located, since they are then located 
in terms of ¢ and many erroneous interpretations result when the zeros 
are accidental. The formula is, of course, limited (even when absolute 
zeros independent of o can be obtained) to a comparison of measure- 


ments at the same level, except as a measure of how far from the mean 
Co 


the zero lies. M used when M is 65 inches cannot be compared directly 


o o 


with Mu used when M is 120 inches, and so also Mu when M is 65 inches 


o 


may not be comparable to M when M is 120 pounds due to their 


relative removal from their zeros. 
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AVERY’S COMPARISON OF THE STANFORD AND 
HERRING REVISIONS 


JOHN P. HERRING 
Director, Bureau of Research, 
New Jersey Department of Institutions and Agencies 


Prepared under the auspices of the Division of Education and Classification, 
William J. Ellis, Director. 


Avery (1924) has presented comparisons between the Stanford 
and the Herring Revisions of the Binet-Simon Tests. Comments on 
this comparison follow the order of presentation and the pagination of 
his article. The most important matter probably concerns mental age 
and intelligence quotient correlations between the two tests—a matter 
which appears first in the comments following. 

(P. 225) Avery’s Tables I, II, and III. The crux here is the inter- 
pretation of correlations as affected by the dispersion of a group 
measured. Kelley (1923) has presented a fundamental discussion of 
this point which ought widely to affect the publication of coefficients 
of correlation in mental and social measurements. The custom of 
interpreting r’s at their face value is, of course, passing in favor of 
more critical attitudes. 

I have not been able to obtain the standard deviations of the group 
of 48 first grade children studied by Avery. Mr. G. M. Willson, of the 
New Jersey Department of Institutions and Agencies, possesses unpub- 
lished summaries made from published and unpublished sources, of 
the standard deviations of mental ages of 72 grade groups in which n 
ranges from 17 to more than 48,000 and in which no grade group below 
the fourth (of which there are 21) has a sigma greater than 13 or less 
than four mental months. 

The 21 grade sigmas are here tabulated: 


TaBLeE I.—StTaNDARD DeviaTIONs OF MENTAL MontTus InN Various Groups 
TS cde ofacas Ue uae bee aha eat K I II III 
6 (41) 4 (29) 4 (28) 6 (23) 
7 (27) 5 (26) 8 (27) 
9 (30) 5 (53) 8 (24) 
6 (21) 10 (31) 
7 (21) 
7 (24) 
7 (28) 
7 (50) 
8 (29) 
9 (25) 
10 (26) 
10 (50) 
13 (35) 
Average standard deviation.......... 6.0 6.7 7.5 8.0 
383 
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A kindergarten of 41 children had a standard deviation of six mental 
months; a first grade of 29 children had a standard deviation of four 
mental months, et cetera. The tests employed were Stanford-Binet, 
National Intelligence Tests and others. The weighted average stand- 
ard deviation for these four graces is 7.4 mental months. Since group 
tests were sometimes used, this weighted average is probably too high 
rather than too low. 

It is likely that the 48 Palo Alto, California, children are a somewhat 
selected group, having a dispersion of mental ages less than average 
for first grades. An element of uncertainty is present in the proba- 
bility that the 48 children were those of two classes. It is likely that 
the standard deviations of the group lay between 4 and 16 mental 
months and between .03 and .121Q. Accordingly, Tables II and III 
below estimate what r’s Avery would probably have found if he had 
tested with the same reliability in first grades having the range typical 
of unselected 12-year-old groups (about 25 mental months) instead 
of that of grade groups. For example, he reported (1924) a correlation 
of .67 between mental ages of Group A (Herring Revision) and of the 
Standard Revision. If his sigma was four mental months, then with 
equally reliable testing in a range represented by a standard! sigma, 
7.e.,.a sigma of 25 mental months, he would have found a correlation 
of .99; if 8, then .97; if 12, then .92; if 16, then .86 (Kelley’s Formula 


TaBLeE I].—Mentat AGE CoRRELATIONS 


oe? 
Test r ened id, fom rs: 
ee os a sw Awa ake beans .67 4 .99 
8 .97 
12 .92 
16 .86 
nin 6a tela'e'ed ocaen fewees .79 4 .99 
8 .98 
12 .95 
16 91 
es Sei eo os 5 wer kek eee .82 4 .00 
8 .98 
12 .96 
16 .93 





1A standard r is an r obtained in a standard group. A standard group is 
one having a standard sigma. A standard sigma is equal to that of unselected 
12-year-olds, which I estimate for the present at 25 mental months, or educational 
months, or the like. 
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178, 1923). Herring (1924) finds a correlation of .88 between Group A 
of the Herring Revision and the 6-test Stanford Revision in a group 
in which the sigma was 35 Stanford months. The _——— 
standard correlation is .77 for which k is .64. 


Tasie III.—IQ CorRELATIONS 


a 
Test 7 pera 1- Q—r)3; 

Rs 6 vnc Sac be So bekeowridt 73 3 .99 
I ho $26 sts da a aes dia 6 .96 
9 .90 

12 .83 

Paso 2uUa' chs seen dl oats tere 77 3 .99 
ah sci iv dived bebedetw ens 6 .96 
i) .92 

12 .85 

SER & vaWo'de Gud We eee aakawes t 80 3 .99 
6 .97 

9 .93 

12 .87 

SR Ss Sows 4 oi eas chk .88 3 1.00 
6 .98 

9 .96 

12 .92 

LSE Fo Sy soem .95 3 1.00 
6 .99 

a .98 

12 .97 


E. J. Bragaw administered the Stanford Revision and the Herring 


Revision in that order to 40 kindergarten children with the following 
results: 


Standard Deviation of Mental Ages (Stanford) in months... 6.6 
Pearson Product Moment Correlation between mental ages.. .84 
Pearson Product Moment Correlation corrected by means of 


formula — = ~; ead. in which ¢ = 6.6 months, = = 25 
z= Vi-r 





Ok le aka . 989 
Coefficient of Alienation (k) corresponding to R = .989 


Such comparisons between the Stanford and Herring Revisions of 
the Binet-Simon Tests as those of Avery and of Bragaw, in which the 
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authors of the tests have no part, are in order. A study is needed of 
the typical relations between three classes of reliability coefficients: 


Class 1. 1r3;’s obtained by an author of a test. : 
Class 2. 717’s obtained by persons whom an author has trained. 
Class 3. ri;’s obtained by persons not trained by an author. 


It may be expected that, other things equal, such coefficients will 
vary directly with closeness of association between author and exam- 
iners, class’3 averaging lowest. Bragaw’s r = .989 is, however, a class 
37. Since r’s of class 3 will usually be more frequent than of 2 or of 1, 
they are an important group. The reporting and analysis of these r’s 
from the field has a significance quite differentiable from the others. 
An author, assumed to possess the most complete knowledge of 
standardization of procedure and scoring, sets and reports a standard 
of reliability to be reached by others ifand when they can. In the case 
of the correlation between the Stanford and Herring Revisions, tech- 
nique and criteria for approximating the reliability reported by Herring 
have been suggested elsewhere (Herring 1924) and are here amplified: 

1. Give the Stanford first to one examinee, following with the 
Herring on another day, within at most two weeks. 

2. Reread the test manuals to discover errors of procedure. If the 
difference in mental ages is greater than four months, discover, if 
possible, the reason for the difference. 

3. Continue thus, always comparing procedure with directions at 
frequent intervals, until an average difference not greater than four 
mental months can be maintained. 

4. After a group of say 20 to 30 have been tested, make a scatter 


diagram and compute r Stanford Herring. Use formula == 


V1-R , ‘ . , 
PY xa for rendering the obtained coefficient comparable with stand- 


ard r’s, unless the group has the standard sigma, 7.e., 25 mental months 
—a situation in which r, directly comparable with other standard r’s, 
needs no correction for dispersion. 

Briefer comments upon other features of Avery’s article follow: 

(P. 225) “These correlations are, of course, fairly good.” The 
correlations of reference average .76, for which k is .65. They, there- 
fore, appear, if taken at face value, to be fairly bad, especially for 
individual testing. They should not, of course, be taken at face. 
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(P. 225) “They indicate that the Stanford and the Herring Tests 
are far from being equivalent.” A better criterion of identity of two 
tests is an r of unity when corrected for attenuation. 

(P. 225) ‘‘Group C has a higher correlation than D and E.” Yet 
To — Tp and r, — rz are both less than the probable errors of any of the 
three r’s! 

(P. 225) “The results corroborate those of the IQ’s.’”’ Because of 
the relationship of dependence of the IQ’s upon the MA’s, neither set 
of correlations may be taken as independent corroboration of the other. 

(P. 226) “‘Group A was too high when compared with the other 
Herring Forms.” ‘The Herring Tests tend to grade too high the child 
of a mental age of less than six.”” I have unpublished data confirming 
both these statements. 

(P. 226) “The pictures . . . not as well drawn as the Stanford.”’ 
A suitable criterion for excellence of drawing might involve coefficients 
of partial correlation showing close relation between quality of drawing 
and test validity, other things equal. 

“This is, of course, at the expense of standardization.’”’ Since 
standardization as described by Herring (1924) was accomplished 
entirely independently of data for the mental level of particular tests, 
the statement seems to me to be in error. 

“‘In 13 cases there was a variation of only two points and in 22 cases 

. of only three points or less.” I take ‘‘two points” to mean “‘two 
points or less.”’ This statement suggests that the work of the examiners 
was more reliable than their own interpretation of their obtained r’s 
indicates. Given the fact that 134gths of the IQ differences were 
.02 or less, assuming (somewhat questionably) a normal distribution of 
differences and reading from Pearson’s Table II (1914), the standard 
deviation of the differences is estimated as .0571 IQ and by Kelley’s 
Formula 178 (Kelley, 1923) the correlation between the Herring and 
the Stanford is estimated as .93, the corresponding k being .37. In the 
same manner given ?24gths of the IQ differences as .03 IQ or less, the 
standard deviation of the differences is estimated as .0492, the standard 
deviation of the IQ’s of the group as .0770 and the standard correlation 
between the Herring and the Stanford as .95, for which k is .31. 
(0.15 IQ is taken as the typical standard deviation of a 12-year-old 
age group.) 

These estimates of .93 and .95, based as they are upon a question- 
able assumption, illustrate the difficulties to which readers are put 
when needed standard deviations are not reported. 
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(P. 226) “‘The Herring is perhaps insufficiently standardized.”’ 
Data are being collected for restandardization. 
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FATIGUE AND WORK CURVE FROM A 10-HOUR 
DAY IN ADDITION 


H. B. REED 
Grinnell College 


Eight subjects, seven college students and the writer, added five 2- 
figure problems from 7:30 in the morning to 5:30 in the afternoon 
stopping 35 minutes at 12:15forlunch. Time was called every minute 
for the first 15 minutes, and every 10 minutes thereafter until the last 
10 minutes when it was again called every minute. The subjects 
marked their papers at every call. After the work was completed the 
number of examples attempted and the number of examples wrong for 
each period was calculated. 

The examples contained no zeros. Thirty-five were printed on a 
sheet. Each student was supplied with 100 sheets for the day’s work. 
Sixty-five different examples were used, and these recurred in a regu- 
lar order. The number should have been larger, but as indicated by 
the errors, and by the testimony of the subjects almost none were 
memorized during the day. To separate the influence of practice 
from fatigue all the subjects did 10 minutes of addition with the same 
problems on the morning after rest from the previous day of addition. 


The amount of fatigue depends entirely on how it is measured. 


But Tables I and II give the essential facts from which fatigue may be 
calculated. Table I gives the totals right and wrong for each hour of 
the day. Table II gives the totals for the following 10-minute periods, 
the first, the last in the forenoon, the last in the afternoon, and the 
first on the morning after the rest. 


TaBLeE I.—Torat Exampies RIGHT AND WRONG PER Hour 


1 2 3 4 5 6 7 8 8) 10! 
Right....... 2410 2260 2433 2105 2186 2167 2306 2361 2277 2259 
Wee. «cies 203 +4177 «+176 188 154 210 236 260 285 278 
1 Estimated on the basis that the last 20 minutes for each of these hours would 
equal the work of the last 20 minutes of the preceding hours. 


TaBLe II 
Last 10 Mrn- 
First 10 Min- Last 10 Min- Last 10 Min- UTES IN THE 
UTES IN THE UTES IN THE UTES IN THE MorNING 
MoRNING MoRNING AFTERNOON AFTER Rest 
WN kG e. Ge 418 399 424 503 
Wrong........... 28 21 66 48 
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The amount of fatigue might presumably be measured by any of the 


following ways, and if so, the amount is indicated opposite each measure. 


1 per cent of decrease in examples right from first hour to last hour....... 6.2 
2 per cent of decrease in accuracy from first hour to last hour. . ee F 
3 per cent of increase in errors from first hour to last hour. . rere: of 
4 per cent of decrease in right—wrong from first hour to last hour.. . 10.3 
5 per cent of increase in examples right from first 10 minutes to last, 10 
is cK Ga a paws OCA SEM AOS CRE OAC EE ORAS AMD DEN SSeS 1.4 
6 per cent of decrease in accuracy from first 10 minutes to last 10 minutes. 7.4 
7 per cent of increase in errors from first 10 minutes to last 10 minutes.... 71.4 
8 per cent of increase in examples right from last 10 minutes before rest to 
OBESE LE LLLP LOTTE S Pe PTE E 18.6 
10 per vent of decrease in right—wrong from first 10 minutes to last 
i inate g Leas 6d bt Wee a ae choke chivas 66 40 oo 45 8.2 
11 per cent of increase in right—wrong from last 10 minutes before rest 
a as bk os BiG i ie iio ee ee el 21.2 


It is seen that the amount of fatigue varies all the way from—1.4 
per cent to +71.4 per cent, so that experimental evidence could be 
found for almost any conclusion which a writer wished to draw. 
Besides these a number of mental and physiological tests might have been 
given before and after work, and a dozen more disagreeing measures 
found in the same way. We need a reliable fatigue measure. Argu- 


TaBLe II].—Nvumser or Exampites RIGHT AND WronG For Eacu 10 


MINUTES 
First Hour Second Hour 
1 2 3 4 5 6 1 2 3 4 5 6 
SEE ae ae 418 406 385 369 450 382 417 345 409 393 315 383 
Mleéicedesss.. e-  e. 2.2 Le ee ae |CU 
Third Hour Fourth Hour 
1 2 3 4 5 6 1 2 3 4 5 6 
eS re 442 416 385 409 429 352 356 355 359 375 338 322 
Wessun... ee Ria 2 2 2D 6h hae a.6hU ee 6S 
Fifth Hour Sixth Hour 
1 2 3 4 1 2 3 4 5 6 
es Ss 358 376 403 399 361 396 342 380 343 345 
, Sy ar 27 2@ 9 @i 33 OH MH 40 8 
Seventh Hour Eighth Hour 
1 2 3 4 5 6 1 2 3 4 5 6 
RS ae 409 364 401 395 370 367 419 418 389 366 382 387 
Se 34 31 41 49 38 43 309 36 45 47 43 «350 
Ninth Hour Tenth Hour 
1 2 3 4 5 6 1 2 3 4 5 6 
Ms a WEN « i < « 408 364 397 394 350 364 378 374 369 424 
caves aces 466 43 46 587 4 5 39 35 4 6 
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ments might be given for each of the above measures. The equiva- 
lent of each of them has been used by one or another psychologist 
in some study of fatigue, and so we could quote authority 
for all of them. To judge the amount of fatigue alone from the 


TaBLe IV 
2 an a A i ee oe 
R first 10 minutes................ 45 37 36 43 34 36 38 40 37 38 
eS ee 42 34 38 33 45 34 37 39 35 42 
R first 10 minutes after rest....... 42 45 49 40 48 51 44 46 47 44 


decrease in the quantity of the work alone correctly per unit of time, 
is to assume that the examples done incorrectly are unimportant. To 
judge the amount of fatigue alone from the increase in the errors per 
unit of time is to ignore the amount of work done correctly. To base 
it on the decrease in the per cent of accuracy ignores the quantity of 
work per unit of time. To base it on the difference in any of these 
measures between the first and last 10 minutes of work may not be 
fair as efficiency of these periods is above the average. The same 
objection applies to the 10-minute periods before and after rest. 
To use hour periods instead of 10-minute ones may be too long. But 
evidently we must use something. To the writer the increase in the 
right minus wrong from the 10-minute before rest to the 10-minute 
after rest appears the best one of the above measures. This penalizes 
for error and considers the fact that work done wrongly makes extra 
work. It also takes account of the work done correctly and separates 
the influence of practice from the influence of fatigue. According to 
this measure a 10-hour day produced a fatigue effect of 21.2 per cent 
or 4. 

Coming now to the Kraepelin characteristics of the work curve 
initial spurt, end spurt, adaptation, practice effect, and fatigue, it 
will be recalled that Thorndike denies the first three. However, 
Chapman claims to have good evidence for initial spurt, when the work 
is calculated for short intervals such as two minutes. To facilitate the 
analysis of these elements, we calculated the number of examples right 
and the number wrong for each 10 minutes of the day, after the first 
five minutes (see Table IIIT). Wealso calculated these results by minutes 
for the first 10 minutes, the last 10 minutes, and the first 10 minutes 
after rest (see Table IV). 

Studying the work done by 10-minute periods in Table IV for the 
Kraepelin characteristics of the work curve we find none of them visibly 
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present except fatigue and that only in a slight degree. The number of 
examples done during the first and last 10 minutes of the day are high 
points in the curve but the first is exceeded by four others and the last 
by three other 10-minute periods during the day. 

We also notice that the work done during the first 10 minutes of 
each hour is higher than that of either the immediately preceding or 
succeeding periods. The call to work, and knowledge that the work 
will soon be finished seem to have stimulating effects, but there seems 
to be no reason for giving any of these high points special names. 

If we study the results by minutes, as in Table IV, we notice that 
the number of examples done during the first minute is greater than 
that of any other minute during the first 10. Here is evidence for 
initial spurt, but it is dispelled as soon as we look at the results for the 
10 minutes after rest. The number done during the last minute of 
the 10-hour day looks like end spurt—until we look at the number done 


during the fifth and tenth minute before the last, when we see that 


those points equal or exceed the last. Our results therefore offer no 
evidence either for initial or for end spurt regardless of whether the 
intervals studied are one hour, 10 minutes, or are one minute long. 
Neither is adaptation visibly present, and practice effect appears only 
when we compare the work done during the 10-minute periods before 


-and after rest. That there was not more practice effect was due no 


doubt to the fact that all of the subjects worked one hour of these 
examples in connection with a practice experiment two weeks before 
the fatigue experiment. 

The appearance of initial spurt discovered by Chapman is no doubt 
due to the fact that his subjects worked only from 10 to 16 
minutes. If they had worked one hour or 10 hours he would have 


probably found many two-minute periods just as productive as the 
first. 
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A COMMUNICATION AND A REPLY 


1. A CHALLENGE STILL OPEN 


Numerous investigations have been published by Professor 
Thorndike and his school claiming to disprove the theory concerning 
intelligence known as that of Two Factors. For such authoritative 
work every psychologist is bound to have the highest respect. And 
for my own part, I yield to no one in admiration for Thorndike’s 
contributions to our science. 

Nevertheless, the fact remains that upon every occasion, whilst 
professing to follow the criteria laid down by me as to the validity of 
this theory, the said investigations have not really done so at all. 
The last time that this happened, I endeavoured to prevent further 
such mishaps by venturing upon the following challenge. 

‘“‘Let Thorndike and myself agree as to what (if any?) ascertainable 
fact would be in harmony with the view of the one of us and not the 
other. Let us agree upon and carry out the same procedure of investi- 
gation. And then let us produce and explain our respective results 
side by side. If a third and impartial investigator can be found, so 
much the better.’’! 

Instead of his taking this up, however, a new research has now been 
brought from his laboratory by Gates and La Salle and published in 
this Journal, which once more professes to disprove the theory by 
means of my own criterion.? But really, these authors employ the very 
same criterion which I most explicitly disclaimed when issuing the 
challenge. Surely, this pushes misunderstanding to the verge of 
misrepresenting! Moreover, the genuine criteria have been set forth 
quite plainly; and they have never in the slightest degree varied.* 
Nor, to the best of my belief has any other serious criterion, or even 
modification of one, been proposed by anybody else. 

I therefore now, in all earnestness and friendliness, renew the 
challenge. On its being accepted, Thorndike and I would, in place of 
the present trifling with one another, march hand in hand towards 
our common goal, the truth. 


C. SPEARMAN. 





1 Psychological Review, 1922. 

2 Vol. XIV, 1923, pp. 517-539. 

3 British Journal Psychology, Vol. V, 1912, pp. 53-60. 
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I am ashamed to have neglected Professor Spearman’s friendly 
challenge for so long. My only excuse is that I have been exceedingly 
busy with certain investigations which had to be completed within 
specified times. 

It seems to me wise to increase our knowledge concerning the organi- 
zation of intellect, and the number and nature and respective contri- 
butions of whatever factors contribute to intellect, along three main 
lines. To make the discussion clearer for those who have not full 
acquaintance with the problem, I shall use an illustrative case in 
describing these. 

Consider that we have 10,000 16-year-olds measured in respect 
of a dozen abilities, such as (1) discrimination of pitch, (2) memory of 
digits, (3) giving the opposites of words, (4) supplying omitted words in 
sentences, (5) defining words, (6) supplying omitted numbers in series 
(such as 1,2... 8,16... ,or 5,1,13% .. .), (7) solving 
arithmetical problems, and (8) completing pictures which are chosen at 
random from intellectual abilities. The problem in which Professor 
Spearman and I are interested, concerns the explanation of their inter- 
correlations, which will all be positive. 

The first and most obvious line of attack seems to me to be to obtain 
fairly precise measures of these abilities, say such that the measure 
would have a mean square chance error not over one tenth of the mean 
square variation of the group of individuals used, say a random 
thousand of 16-year-olds. This would mean that the self-corre- 
lations in such a group would be .95 or higher. 

The intercorrelations (after correction for attenuation) of the scores 
in the dozen abilities and their partial correlations with some reasonable 
criterion score of intellect as a whole which might be built up by combin- 
ing the scores of each individual in the dozen abilities, would obviously 
be instructive. The meaning of similar past experiments is somewhat 
obscured and complicated by the lack of precision in the tests, which 
often required very extensive correction for attenuation. Professor 
Spearman did us a great service in 1904 by showing the need for this 
correction, and convenient means of making it. If, however, the 
denominators in the correction formulas can be kept up to .95 or 
higher, all the subsequent consideration of the correlations and the 
partial correlations will be freed from uncertainties about the nature 
and effects of the errors of measurement. 

A second useful line of attack seems to be to secure correlational 
data from groups so large that none of the correlations need be ruled 
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out when the technique of arrangement in a hierarchy and of correla- 
tion of the correlations is applied to them. 

A third useful line of attack seems to be to operate the hierarchy 
and correlation of the columns technique with coefficients corrected 
for attenuation rather than with ‘‘raw”’ coefficients. 

These suggestions all involve heavy labor, and may be neither 
Professor Spearman nor I nor anyone else will find the time to carry 
them out soon. We may have to content ourselves for the next 20 
years, as for the past 20, by working with crude measurements, 
adding more evidence here and new methods of treatment there, gradu- 
ally improving our knowledge of how mental abilities are interrelated. 

Dr. T. L. Kelley and Mr. Bailor are now doing what I am sure will 
be valuable work, on the general problem, though neither of them is 
working in just the ways which I have suggested. It is probable that 
Professor Spearman will find more useful ways of continuing his work 
than by following any one of my suggestions. Indeed, I most heartily 
pray him to do what he thinks best, regardless of them. 


E. L. THORNDIKE. 
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CONDUCTED BY LAURA ZIRBES 
More MEASUREMENT 


Measuring Results in Education, by Marion Rex Trabue. American 
Book Company, 1924. Pp. 492. 


This book represents Professor Trabue’s idea of what elementary 
teachers need to know about measurement. It emphasizes principles 
rather than the listing of tests and scales, or the detailed outlining of 
procedure; but it develops and makes clear these principles through the 
description of selected type tests and scales and through accounts of 
their practical use. 

The student who masters this text will have some knowledge of the 
history of the measurement movement; an acquaintance with a num- 
ber of the best tests and scales in spelling, handwriting, arithmetic, 
reading, and composition; a possibly more than adequate compre- 
hension of elementary statistical procedure in connection with the 
treatment of scores; and a set of excellent criteria for judging measure- 
ment devices and measurement practice in general. There are enough 
special exercises and references to lead him on as deeply as he may care 
to go. 

Granting that Professor Trabue’s treatment of the subject is 
carefully directed and properly restrained, that his language is straight- 
forward, that he has very cleverly interwoven the practical and the 
theoretical, and that he is safe and sound in his judgments, one still 
may feel that an elementary school teacher could be more profitably 
instructed by material dealing with the diagnostic features of measure- 
ment and with the meeting of the individual needs revealed by tests. 
That some such idea lurks in the mind of the author himself is indi- 
cated by one of his concluding statements. He says: ‘Teachers will 
undoubtedly make more and more use of exactly measured facts 
regarding their pupils, although it is by no means certain that they 
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will themselves become specialists in the administration of all these 
tests.’”’? Like most of its predecessors this volume seems more adapted 
to the needs of the measurement specialist than to those of the class- 
room teacher. 


M. H. WILuINa. 





A Scotcu PsycHo.Locy or EpvUCATION 


The Psychology of Education, by David Kennedy-Fraser. Boni and 
Liveright, New York, 1924. Pp. VIII + 232. 


This short and very readable book has been written by a lecturer at 
Edinburgh University and Training Center. Presumably it is 
intended as an elementary text for Scottish teachers in training. 
Its material, however, is not to be distinguished, in the main, from that 
of American texts which concern themselves with showing the applica- 
tion of general psychology to school teaching. ‘There are brief chapters 
on heredity, instincts, dispositions and interests, sensation and image, 
attention, perception, imagination, memory, and reaction. Some 
space is given to topics more narrowly educational, such as intelligence 
and its measurement, perceptual learning, the laws of learning, the 
results of the learning process, the thought process and school discipline. 
The sources for the most part are American, and many of the illus- 
trations are from American school experiences. 

This book differs from its American prototypes in being unprovided 
with reference lists, special exercises, graphical illustrations, or any 
of the headlining devices that are supposed to steer the harassed 
student toward learning satisfactions in this field. It is sufficiently 
unscholastic in appearance and style to fool the lay reader, or to tempt 
the teacher who dislikes courses. 

Altogether this is a creditable handling of matters which all teachers 
should know; but it does not by any means include all they should 
know, or, possibly, all that today is relatively most important for 
them to know. 


M. H. WILLING. 
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PROBLEMS IN EDUCATION 


Problems in the Administration of a School System, by J. B. Edmonson 
and Erwin E. Lewis. 

Problems of the High-school Teacher, by J. B. Edmonson and Raleigh 
Schorling. 

Problems of the Rural Teacher, by Marvin S. Pittman. The Public 
School Publishing Company, Bloomington, Illinois, 1924. 


These three sets of problems for use in education courses are 
Numbers 4, 5, and 6 respectively of Professor G. M. Whipple’s Educa- 
tional Problems Series. The first three of the series—Whipple’s 
Problems in Educational Psychology, Woody’s Problems in Elementary 
School Instruction, and Edmonson’s Problems in Secondary Education 
—dealt primarily with the theoretical basis and the technique of 
instruction. These last three deal mainly with school situations of the 
administrative, the routine management, the professionally ethical, and 
the community relationship types. Like the first three they are 
attempts “to bridge the gap between theory and practice.” 

Problems in the Administration of a School System is pretty much a 
harrowing collection of the bristling difficulties, the embarrassments, 
the annoyances, and the downright injustices encountered by the 
average school superintendent in a town or small city. It is a fine and 
realistic assortment for the student who has had experience in the 
superintendency, but a very discouraging one for the novice. The 
book needs a supplement playing up some of the situations where a 
superintendent is more nearly the master of his own fate. 

Problems of the High-school Teacher is organized under such headings 
as aims, the teacher, the pupils, discipline, extra-curricular activities, 
school morale, classroom technique, training in habits of study, student 
rating, salaries, and professional duties and responsibilities. It seems 
to cover the field, outside actual instruction, pretty thoroughly. The 
statements of the problems and the listings of references are such as to 
stimulate wide reading. In fact, study and reading rather than dis- 
cussion or the interchange of experiences, are obviously relied upon to 
furnish the answers to most of the questions. The book could very 
safely be used as the basis or outline of a well rounded course in secon- 
dary teaching. 

Problems of the Rural Teacher is the most cheerful of the three sets. 
The situations presented here are those that confront three capable 
beginning teachers in their first year’s work in three different kinds of 
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rural schools. What becomes more apparent than anything else, 
perhaps, as these three teachers meet and solve their problems, is 
that success comes by various routes. The solution of many problems 
here hangs upon the nature of the community, the type of pupils, and 
the temperament of the teacher. Emphasis is upon an all-around view 
of the factors in the situation, rather than upon principles. As a 
consequence the organization of the problems is not very obvious, and 
it may be that many of the problems themselves are too specific. 
However, the compilation is a very interesting one, and its spirit 
should prove decidedly exhilarating to young rural school teachers. 

These problem pamphlets will undoubtedly be widely and prof- 
itably used. The material is the kind that provokes thinking, that 
seems to get down to “brass tacks,” as it were, and that will wake up 
classes. This ultimate value in any instance will depend, of course, 
upon the skill of the instructor in directing and controlling discussion, 
in clearly identifying and emphasizing principles as they emerge, and 
in seeing that serious gaps are not left in students’ preparation. 


M. H. WILu1ina. 
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